spacekit.extractor.scrape
- class spacekit.extractor.scrape.Scraper(cache_dir='~', cache_subdir='data', format='zip', extract=True, clean=True, name='Scraper', **log_kws)[source]
Bases:
object
Parent Class for various data scraping subclasses. Instantiating the appropriate subclass is preferred.
Instantiates a spacekit.extractor.scrape.Scraper object.
- Parameters:
cache_dir (str, optional) – parent folder to save data, by default “~”
cache_subdir (str, optional) – save data in a subfolder one directory below
cache_dir
, by default “data”format (str, optional) – archive format type, by default “zip”
extract (bool, optional) – extract the contents of the compressed archive file, by default True
- class spacekit.extractor.scrape.FileScraper(search_path='', search_patterns=['*.zip'], cache_dir='~', cache_subdir='data', format='zip', extract=True, clean=False, name='FileScraper', **log_kws)[source]
Bases:
Scraper
Scraper subclass used to search and extract files on local disk that match regex/glob pattern(s).
- Parameters:
Scraper (spacekit.extractor.scrape.Scraper object) – parent Scraper class
Instantiates a spacekit.extractor.scrape.FileScraper object.
- Parameters:
search_path (str, optional) – top-level path to search through, by default “”
search_patterns (list, optional) – glob pattern strings, by default
["*.zip"]
cache_dir (str, optional) – parent folder to save data, by default “~”
cache_subdir (str, optional) – save data in a subfolder one directory below
cache_dir
, by default “data”format (str, optional) – archive format type, by default “zip”
extract (bool, optional) – extract the contents of the compressed archive file, by default True
clean (bool, optional) – remove compressed file after extraction, by default False
name (str, optional) – logging name, by default “FileScraper”
- class spacekit.extractor.scrape.WebScraper(uri, dataset, hash_algorithm='md5', cache_dir='~', cache_subdir='data', format='zip', extract=True, clean=True, **log_kws)[source]
Bases:
Scraper
Scraper subclass for extracting publicly available data off the web.
- Parameters:
Scraper (class) – spacekit.extractor.scrape.Scraper object
Uses dictionary of uri, filename and hash key-value pairs to download data securely from a website such as Github.
- Parameters:
uri (string) – root uri (web address)
dataset (dictionary) – key-pair values of each dataset’s filenames and hash keys
hash_algorithm (str, optional) – type of hash key algorithm used, by default “sha256”
cache_dir (str, optional) – parent folder to save data, by default “~”
cache_subdir (str, optional) – save data in a subfolder one directory below
cache_dir
, by default “data”format (str, optional) – archive format type, by default “zip”
extract (bool, optional) – extract the contents of the compressed archive file, by default True
clean (bool, optional) – remove compressed file after extraction
- scrape()[source]
Using the key-pair values in
dataset
dictionary attribute, download the files from a github repo and check the hash keys match before extracting. Extraction and hash-key checking is handled externally by thekeras.utils.data_utils.get_file
method. If extraction is successful, the archive file will be deleted.- Returns:
paths to downloaded and extracted files
- Return type:
- class spacekit.extractor.scrape.S3Scraper(bucket, pfx='archive', dataset=None, cache_dir='~', cache_subdir='data', format='zip', extract=True, **log_kws)[source]
Bases:
Scraper
Scraper subclass for extracting data from an AWS s3 bucket (requires AWS credentials with permissions to access the bucket.)
- Parameters:
Scraper (class) – spacekit.extractor.scrape.Scraper object
Instantiates a spacekit.extractor.scrape.S3Scraper object
- Parameters:
bucket (string) – s3 bucket name
pfx (str, optional) – aws bucket prefix (subfolder uri path), by default “archive”
dataset (dictionary, optional) – key-value pairs of dataset filenames and prefixes, by default None
cache_dir (str, optional) – parent folder to save data, by default “~”
cache_subdir (str, optional) – save data in a subfolder one directory below
cache_dir
, by default “data”format (str, optional) – archive format type, by default “zip”
extract (bool, optional) – extract the contents of the compressed archive file, by default True
- make_s3_keys(fnames=['2022-02-14-1644848448.zip', '2021-11-04-1636048291.zip', '2021-10-28-1635457222.zip'])[source]
Generates a
dataset
dictionary attribute containing the filename-uriprefix key-value pairs.- Parameters:
fnames (list, optional) – dataset archive file names typically consisting of a hyphenated date and timestamp string when the data was generated (automatically the case for saved spacekit.analyzer.compute.Computer objects), by default [ “2021-10-28-1635457222.zip””2021-11-04-1636048291.zip”, “2021-10-28-1635457222.zip” ]
- Returns:
key-value pairs of dataset archive filenames and their parent folder prefix name
- Return type:
- class spacekit.extractor.scrape.JsonScraper(search_path='/home/docs/checkouts/readthedocs.org/user_builds/spacekit/checkouts/latest/docs/source', search_patterns=['*_total_*_svm_*.json'], file_basename='svm_data', crpt=0, save_csv=False, store_h5=True, h5_file=None, output_path=None, **log_kws)[source]
Searches local files using glob pattern(s) to scrape JSON file data. Optionally can store data in h5 file (default) and/or CSV file; The JSON harvester method returns a Pandas dataframe. This class can also be used to load an h5 file. CREDIT: Majority of the code here was repurposed into a class object from
Drizzlepac.hap_utils.json_harvester
- multiple customizations were needed for specific machine learning preprocessing that would be outside the scope of Drizzlepac’s primary intended use-cases, hence why the code is now here in a stripped down version instead of submitted as a PR to the original repo. That, and the need to avoid including Drizzlepac as a dependency for spacekit, since spacekit is meant to be used for testing Drizzlepac’s SVM processing…- Parameters:
FileScraper (spacekit.extractor.scrape.FileScraper) – parent FileScraper class
Initializes a JsonScraper class object
- Parameters:
search_path (_type_, optional) – The full path of the directory that will be searched for json files to process, by default os.getcwd()
search_patterns (list, optional) – list of glob patterns to use for search, by default [”_total_*_svm_.json”]
file_basename (str, optional) – Name of the output file basename (filename without the extension) for the Hierarchical Data Format version 5 (HDF5) .h5 file that the DataFrame will be written to, by default “svm_data”
crpt (int, optional) – Uses extended dataframe index name to differentiate from normal svm data, by default 0
save_csv (bool, optional) – store h5 data into a CSV file, by default False
store_h5 (bool, optional) – save data in hdf5 format, by default True
h5_file (str or path, optional) – load from a saved hdf5 file on local disk, by default None
output_path (str or path, optional) – where to save the data, by default None
- flatten_dict(dd, separator='.', prefix='')[source]
Recursive subroutine to flatten nested dictionaries down into a single-layer dictionary. Borrowed from Drizzlepac, which borrowed it from: https://www.geeksforgeeks.org/python-convert-nested-dictionary-into-flattened-dictionary/
- Parameters:
- Returns:
a version of input dictionary dd that has been flattened by one layer
- Return type:
dictionary
- get_json_files()[source]
Uses glob to create a list of json files to harvest. This function looks for all the json files containing qa test results generated by
runastrodriz
andrunsinglehap
. The search starts in the directory specified in thesearch_path
parameter, but will look in immediate sub-directories as well if no json files are located in the directory specified bysearch_path
.- Returns:
out_json_dict containing lists of all identified json files, grouped by and keyed by Pandas DataFrame index value.
- Return type:
ordered dictionary
- h5store(**kwargs)[source]
Store pandas Dataframe to an HDF5 file on local disk.
- Returns:
path to stored h5 file
- Return type:
string
- json_harvester()[source]
Main calling function to harvest json files matching the search pattern and store in dictionaries which are then combined into a single dataframe.
- Returns:
dataset created by scraping data from json files on local disk.
- Return type:
dataframe
- load_h5_file()[source]
Loads dataframe from an H5 on local disk
- Returns:
data loaded from an H5 file and stored in a dataframe object attribute.
- Return type:
dataframe
- Raises:
Exception – Requested file not found
- make_dataframe_line(json_filename_list)[source]
Extracts information from the json files specified by the input list json_filename_list. Main difference between this and the original Drizzlepac source code is a much more limited collection of data: descriptions and units are not collected; only a handful of specific keyword values are scraped from general information and header extensions.
- Parameters:
json_filename_list (list) – list of json files to process
- Returns:
ingest_dict – ordered dictionary containing all information extracted from json files specified by the input list json_filename_list.
- Return type:
- read_json_file(json_filename)[source]
extracts header and data sections from specified json file and returns the header and data (in its original pre-json format) as a nested ordered dictionary
Supported output data types:
all basic single-value python data types (float, int, string, Boolean, etc.)
lists
simple key-value dictionaries and ordered dictionaries
multi-layer nested dictionaries and ordered dictionaries
tuples
numpy arrays
astropy tables
- Parameters:
json_filename (str) – Name of the json file to extract data from
- Returns:
out_dict structured similarly to self.out_dict with separate ‘header’ and ‘data’ keys. The information stored in the ‘data’ section will be in the same format that it was in before it was serialized and stored as a json file.
- Return type:
dictionary