spacekit.extractor.scrape

Inheritance diagram of spacekit.extractor.scrape

class spacekit.extractor.scrape.Scraper(cache_dir='~', cache_subdir='data', format='zip', extract=True, clean=True, name='Scraper', **log_kws)[source]

Bases: object

Parent Class for various data scraping subclasses. Instantiating the appropriate subclass is preferred.

Instantiates a spacekit.extractor.scrape.Scraper object.

Parameters:
  • cache_dir (str, optional) – parent folder to save data, by default “~”

  • cache_subdir (str, optional) – save data in a subfolder one directory below cache_dir, by default “data”

  • format (str, optional) – archive format type, by default “zip”

  • extract (bool, optional) – extract the contents of the compressed archive file, by default True

check_cache(cache)[source]
compress_files(target_folder, fname=None, compression='zip')[source]
extract_archives()[source]

Extract the contents of the compressed archive file(s).

TODO: extract other archive types (.tar, .tgz)

Returns:

paths to downloaded and extracted dataset files

Return type:

list

class spacekit.extractor.scrape.FileScraper(search_path='', search_patterns=['*.zip'], cache_dir='~', cache_subdir='data', format='zip', extract=True, clean=False, name='FileScraper', **log_kws)[source]

Bases: Scraper

Scraper subclass used to search and extract files on local disk that match regex/glob pattern(s).

Parameters:

Scraper (spacekit.extractor.scrape.Scraper object) – parent Scraper class

Instantiates a spacekit.extractor.scrape.FileScraper object.

Parameters:
  • search_path (str, optional) – top-level path to search through, by default “”

  • search_patterns (list, optional) – glob pattern strings, by default ["*.zip"]

  • cache_dir (str, optional) – parent folder to save data, by default “~”

  • cache_subdir (str, optional) – save data in a subfolder one directory below cache_dir, by default “data”

  • format (str, optional) – archive format type, by default “zip”

  • extract (bool, optional) – extract the contents of the compressed archive file, by default True

  • clean (bool, optional) – remove compressed file after extraction, by default False

  • name (str, optional) – logging name, by default “FileScraper”

scrape()[source]

Search local disk for files matching glob regex pattern(s)

Returns:

paths to dataset files found in glob pattern search

Return type:

list

class spacekit.extractor.scrape.WebScraper(uri, dataset, hash_algorithm='md5', cache_dir='~', cache_subdir='data', format='zip', extract=True, clean=True, **log_kws)[source]

Bases: Scraper

Scraper subclass for extracting publicly available data off the web.

Parameters:

Scraper (class) – spacekit.extractor.scrape.Scraper object

Uses dictionary of uri, filename and hash key-value pairs to download data securely from a website such as Github.

Parameters:
  • uri (string) – root uri (web address)

  • dataset (dictionary) – key-pair values of each dataset’s filenames and hash keys

  • hash_algorithm (str, optional) – type of hash key algorithm used, by default “sha256”

  • cache_dir (str, optional) – parent folder to save data, by default “~”

  • cache_subdir (str, optional) – save data in a subfolder one directory below cache_dir, by default “data”

  • format (str, optional) – archive format type, by default “zip”

  • extract (bool, optional) – extract the contents of the compressed archive file, by default True

  • clean (bool, optional) – remove compressed file after extraction

scrape()[source]

Using the key-pair values in dataset dictionary attribute, download the files from a github repo and check the hash keys match before extracting. Extraction and hash-key checking is handled externally by the keras.utils.data_utils.get_file method. If extraction is successful, the archive file will be deleted.

Returns:

paths to downloaded and extracted files

Return type:

list

class spacekit.extractor.scrape.S3Scraper(bucket, pfx='archive', dataset=None, cache_dir='~', cache_subdir='data', format='zip', extract=True, **log_kws)[source]

Bases: Scraper

Scraper subclass for extracting data from an AWS s3 bucket (requires AWS credentials with permissions to access the bucket.)

Parameters:

Scraper (class) – spacekit.extractor.scrape.Scraper object

Instantiates a spacekit.extractor.scrape.S3Scraper object

Parameters:
  • bucket (string) – s3 bucket name

  • pfx (str, optional) – aws bucket prefix (subfolder uri path), by default “archive”

  • dataset (dictionary, optional) – key-value pairs of dataset filenames and prefixes, by default None

  • cache_dir (str, optional) – parent folder to save data, by default “~”

  • cache_subdir (str, optional) – save data in a subfolder one directory below cache_dir, by default “data”

  • format (str, optional) – archive format type, by default “zip”

  • extract (bool, optional) – extract the contents of the compressed archive file, by default True

authorize_aws()[source]
import_dataset()[source]

import job metadata file from s3 bucket

make_s3_keys(fnames=['2022-02-14-1644848448.zip', '2021-11-04-1636048291.zip', '2021-10-28-1635457222.zip'])[source]

Generates a dataset dictionary attribute containing the filename-uriprefix key-value pairs.

Parameters:

fnames (list, optional) – dataset archive file names typically consisting of a hyphenated date and timestamp string when the data was generated (automatically the case for saved spacekit.analyzer.compute.Computer objects), by default [ “2021-10-28-1635457222.zip””2021-11-04-1636048291.zip”, “2021-10-28-1635457222.zip” ]

Returns:

key-value pairs of dataset archive filenames and their parent folder prefix name

Return type:

dict

static s3_download(keys, bucket_name, prefix)[source]
static s3_upload(keys, bucket_name, prefix)[source]
scrape()[source]

Downloads files from s3 using the configured boto3 client. Calls the extract_archive method for automatic extraction of file contents if object’s extract attribute is set to True.

Returns:

paths to downloaded and extracted files

Return type:

list

scrape_s3_file(fpath, obj)[source]
class spacekit.extractor.scrape.JsonScraper(search_path='/home/docs/checkouts/readthedocs.org/user_builds/spacekit/checkouts/latest/docs/source', search_patterns=['*_total_*_svm_*.json'], file_basename='svm_data', crpt=0, save_csv=False, store_h5=True, h5_file=None, output_path=None, **log_kws)[source]

Searches local files using glob pattern(s) to scrape JSON file data. Optionally can store data in h5 file (default) and/or CSV file; The JSON harvester method returns a Pandas dataframe. This class can also be used to load an h5 file. CREDIT: Majority of the code here was repurposed into a class object from Drizzlepac.hap_utils.json_harvester - multiple customizations were needed for specific machine learning preprocessing that would be outside the scope of Drizzlepac’s primary intended use-cases, hence why the code is now here in a stripped down version instead of submitted as a PR to the original repo. That, and the need to avoid including Drizzlepac as a dependency for spacekit, since spacekit is meant to be used for testing Drizzlepac’s SVM processing…

Parameters:

FileScraper (spacekit.extractor.scrape.FileScraper) – parent FileScraper class

Initializes a JsonScraper class object

Parameters:
  • search_path (_type_, optional) – The full path of the directory that will be searched for json files to process, by default os.getcwd()

  • search_patterns (list, optional) – list of glob patterns to use for search, by default [”_total_*_svm_.json”]

  • file_basename (str, optional) – Name of the output file basename (filename without the extension) for the Hierarchical Data Format version 5 (HDF5) .h5 file that the DataFrame will be written to, by default “svm_data”

  • crpt (int, optional) – Uses extended dataframe index name to differentiate from normal svm data, by default 0

  • save_csv (bool, optional) – store h5 data into a CSV file, by default False

  • store_h5 (bool, optional) – save data in hdf5 format, by default True

  • h5_file (str or path, optional) – load from a saved hdf5 file on local disk, by default None

  • output_path (str or path, optional) – where to save the data, by default None

flatten_dict(dd, separator='.', prefix='')[source]

Recursive subroutine to flatten nested dictionaries down into a single-layer dictionary. Borrowed from Drizzlepac, which borrowed it from: https://www.geeksforgeeks.org/python-convert-nested-dictionary-into-flattened-dictionary/

Parameters:
  • dd (dict) – dictionary to flatten

  • separator (str, optional) – separator character used in constructing flattened dictionary key names from multiple recursive elements. Default value is ‘.’

  • prefix (str, optional) – flattened dictionary key prefix. Default value is an empty string (‘’).

Returns:

a version of input dictionary dd that has been flattened by one layer

Return type:

dictionary

get_json_files()[source]

Uses glob to create a list of json files to harvest. This function looks for all the json files containing qa test results generated by runastrodriz and runsinglehap. The search starts in the directory specified in the search_path parameter, but will look in immediate sub-directories as well if no json files are located in the directory specified by search_path.

Returns:

out_json_dict containing lists of all identified json files, grouped by and keyed by Pandas DataFrame index value.

Return type:

ordered dictionary

h5store(**kwargs)[source]

Store pandas Dataframe to an HDF5 file on local disk.

Returns:

path to stored h5 file

Return type:

string

json_harvester()[source]

Main calling function to harvest json files matching the search pattern and store in dictionaries which are then combined into a single dataframe.

Returns:

dataset created by scraping data from json files on local disk.

Return type:

dataframe

load_h5_file()[source]

Loads dataframe from an H5 on local disk

Returns:

data loaded from an H5 file and stored in a dataframe object attribute.

Return type:

dataframe

Raises:

Exception – Requested file not found

make_dataframe_line(json_filename_list)[source]

Extracts information from the json files specified by the input list json_filename_list. Main difference between this and the original Drizzlepac source code is a much more limited collection of data: descriptions and units are not collected; only a handful of specific keyword values are scraped from general information and header extensions.

Parameters:

json_filename_list (list) – list of json files to process

Returns:

ingest_dict – ordered dictionary containing all information extracted from json files specified by the input list json_filename_list.

Return type:

collections.OrderedDict

read_json_file(json_filename)[source]

extracts header and data sections from specified json file and returns the header and data (in its original pre-json format) as a nested ordered dictionary

Supported output data types:

  • all basic single-value python data types (float, int, string, Boolean, etc.)

  • lists

  • simple key-value dictionaries and ordered dictionaries

  • multi-layer nested dictionaries and ordered dictionaries

  • tuples

  • numpy arrays

  • astropy tables

Parameters:

json_filename (str) – Name of the json file to extract data from

Returns:

out_dict structured similarly to self.out_dict with separate ‘header’ and ‘data’ keys. The information stored in the ‘data’ section will be in the same format that it was in before it was serialized and stored as a json file.

Return type:

dictionary

write_to_csv()[source]

optionally write dataframe out to .csv file.