spacekit.preprocessor.scrub

class spacekit.preprocessor.scrub.HstCalScrubber(data=None, output_path=None, output_file='batch.csv', dropnans=True, save_raw=True, **log_kws)[source]

class spacekit.preprocessor.scrub.HstSvmScrubber(input_path, data=None, output_path=None, output_file='svm_data', dropnans=True, save_raw=True, make_pos_list=True, crpt=0, make_subsamples=False, **log_kws)[source]

Class for invocating standard preprocessing steps of Single Visit Mosaic regression test data.

Parameters:

input_path (str or Path) – path to directory containing data input files
data (dataframe, optional) – dataframe containing raw inputs scraped from json (QA) files, by default None
output_path (str or Path, optional) – location to save preprocessed output files, by default None
output_file (str, optional) – file basename to assign preprocessed dataset, by default “svm_data”
dropnans (bool, optional) – find and remove any NaNs, by default True
save_raw (bool, optional) – save data as csv before any encoding is performed, by default True
make_pos_list (bool, optional) – create a text file listing misaligned (label=1) datasets, by default True
crpt (int, optional) – dataset contains synthetically corrupted data, by default 0
make_subsamples (bool, optional) – save a random selection of aligned (label=0) datasets to text file, by default False

add_crpt_labels()[source]

For new synthetic datasets, adds “label” target column and assigns value of 1 to all rows.

Returns:: self.df updated with label column (all values set = 1)
Return type:: dataframe

find_subsamples()[source]: Gets a varied sampling of dataframe observations and saves to local text file. This is one way of identifying a small subset for synthetic data generation.

make_pos_label_list()[source]: Looks for target class labels in dataframe and saves a text file listing index names of positive class. Originally this was to automate moving images into class labeled directories.

preprocess_data()[source]: Main calling function to run each preprocessing step for SVM regression data.

scrub_columns()[source]: Initial dataframe scrubbing to extract and rename columns, drop NaNs, and set the index.

scrub_qa_summary(csvfile='single_visit_mosaics*.csv', idx=0)[source]: Alternative if no .json files available (QA step not run during processing)

class spacekit.preprocessor.scrub.JwstCalScrubber(input_path, data=None, pfx='', sfx='_uncal.fits', dropnans=False, save_raw=True, encoding_pairs=None, mode='fits', **log_kws)[source]

Class for invoking initial preprocessing of JWST calibration input data. :param Scrubber: spacekit.preprocessor.scrub.Scrubber parent class :type Scrubber: class

Initializes a JwstCalScrubber class object. :param input_path: path on local disk where L1 input exposures are located :type input_path: str or path :param data: dataframe of exposures to be preprocessed, by default None :type data: pd.DataFrame, optional :param pfx: limit scrape search to files starting with a given prefix such as ‘jw01018’, by default “” :type pfx: str, optional :param sfx: limit scrape search to files ending with a given suffix, by default “_uncal.fits” :type sfx: str, optional :param dropnans: drop null value columns, by default False :type dropnans: bool, optional :param save_raw: save a copy of the dataframe before encoding, by default True :type save_raw: bool, optional :param encoding_pairs: preset key-value pairs for encoding categorical data, by default None :type encoding_pairs: dict, optional :param mode: determines how data is scraped and handled (‘fits’ for files or ‘df’ for dataframe), by default ‘fits’ :type mode: str, optional

fake_target_ids()[source]

Assigns a fake target ID using TARGNAME, TARG_RA or GS_MAG. These IDs are fake in that they’re unlikely to match actual target IDs assigned later in the pipeline. For source-based exposures, the id is always “s00001”.

Grouping logic: - TARG_RA (rounded to 6 decimals): VISITYPE=PRIME_TARGETED_FIXED, TARGNAME=NaN - TARGNAME: VISITYPE != PRIME_TARGETED_FIXED, TARGNAME != NaN - GS_MAG : TARGNAME=NaN, GSMAG != NaN, VISITYPE != “PRIME_TARGETED_FIXED”, “PARALLEL_PURE”

Remaining groups not matching above parameters default to ‘t0’ (typically ‘parallel_pure’ visitypes).

get_dtype_keys()[source]: Group input metadata into pre-set data types before applying NaNdlers. :returns: key-value pairs of data type and exposure header / column name :rtype: dict

get_level3_products()[source]: Determines potential L3 products based on groups of input exposures with matching Fits keywords prog+obs+optelem+fxd_slit+subarray. These groups are further subdivided and assigned a fake target ID by TARGNAME, GS_MAG or TARG_RA.

property input_data: Preprocessed input data grouped by exposure type :returns: input data grouped by exp_type (IMAGE, SPEC, FGS, TAC) :rtype: dict

make_image_product_name(k, v, tnum)[source]: Parse through exposure metadata to create expected L3 image products. :param k: exposure header key (L1 exposure name) :type k: str :param v: exposure header data :type v: dict :param tnum: number assigned to each unique target (targ_ra) within a program :type tnum: str

make_spec_product_name(k, v, tnum)[source]: Parse through exposure metadata to create expected L3 spectroscopy products. NOTE: Although the pipeline would create multiple products for either source-based exposures or (channel-based) MIRI MRS exposures, only one product name will be created since the model is concerned with RAM, i.e. how large the memory footprint is to calibrate a set of input exposures. Source-based products use “s00001” for the source; MIR_MRS exposures default to “ch4” for channel. :param k: exposure header key (L1 exposure name) :type k: str :param v: exposure header data :type v: dict :param tnum: number assigned to each unique target (targ_ra) within a program :type tnum: str

make_tac_product_name(k, v, p)[source]: If an image or spec product meets the required conditions, it is added instead to the TAC products dictionary (Time-series, AMI, Coronagraph). :param k: exposure header key (L1 exposure name) :type k: str :param v: exposure header data :type v: dict :param p: product name :type p: str

pixel_offsets()[source]: Generate the pixel offset between exposure reference pixels and the estimated L3 fiducial.

scrape_inputs()[source]: Scrape input exposure header metadata from fits files on local disk located at self.input_path.

scrub_inputs(exp_type='IMAGE')[source]

Main calling function for preprocessing input exposures of a given exposure type. :param exp_type: Exposure type, by default “IMAGE” :type exp_type: str, optional

Returns:: preprocessed data with renamed columns, NaNs scrubbed and categorical data encoded
Return type:: pd.DataFrame

verify_target_groups()[source]: Certain L3 products need to be further defined by their L1 input TARG_RA values in addition to all other parameters. This only affects PRIME_TARGETED_FIXED visit types where TARGNAME != NaN. If multiple unique TARG_RA/DEC values (rounded to 6 digits) are identified within the group of exposures, we can assume each TARG grouping is a unique L3 product.

class spacekit.preprocessor.scrub.Scrubber(data=None, col_order=None, output_path=None, output_file=None, dropnans=True, save_raw=True, name='Scrubber', **log_kws)[source]

Base parent class for preprocessing data. Includes some basic column scrubbing methods for pandas dataframes. The heavy lifting is done via subclasses below.

save_csv_file(df=None, pfx='', index_col='index')[source]

Saves dataframe to csv file on local disk.

Parameters:: pfx (str, optional) – Insert a prefix at start of filename, by default “”
Returns:: self.data_path where file is saved on disk.
Return type:: str