
class spacekit.preprocessor.scrub.HstCalScrubber(data=None, output_path=None, output_file='batch.csv', dropnans=True, save_raw=True, **log_kws)[source]
class spacekit.preprocessor.scrub.HstSvmScrubber(input_path, data=None, output_path=None, output_file='svm_data', dropnans=True, save_raw=True, make_pos_list=True, crpt=0, make_subsamples=False, **log_kws)[source]

Class for invocating standard preprocessing steps of Single Visit Mosaic regression test data.

  • input_path (str or Path) – path to directory containing data input files

  • data (dataframe, optional) – dataframe containing raw inputs scraped from json (QA) files, by default None

  • output_path (str or Path, optional) – location to save preprocessed output files, by default None

  • output_file (str, optional) – file basename to assign preprocessed dataset, by default “svm_data”

  • dropnans (bool, optional) – find and remove any NaNs, by default True

  • save_raw (bool, optional) – save data as csv before any encoding is performed, by default True

  • make_pos_list (bool, optional) – create a text file listing misaligned (label=1) datasets, by default True

  • crpt (int, optional) – dataset contains synthetically corrupted data, by default 0

  • make_subsamples (bool, optional) – save a random selection of aligned (label=0) datasets to text file, by default False


For new synthetic datasets, adds “label” target column and assigns value of 1 to all rows.


self.df updated with label column (all values set = 1)

Return type:



Gets a varied sampling of dataframe observations and saves to local text file. This is one way of identifying a small subset for synthetic data generation.


Looks for target class labels in dataframe and saves a text file listing index names of positive class. Originally this was to automate moving images into class labeled directories.


Main calling function to run each preprocessing step for SVM regression data.


Initial dataframe scrubbing to extract and rename columns, drop NaNs, and set the index.

scrub_qa_summary(csvfile='single_visit_mosaics*.csv', idx=0)[source]

Alternative if no .json files available (QA step not run during processing)

class spacekit.preprocessor.scrub.JwstCalScrubber(input_path, data=None, pfx='', sfx='_uncal.fits', dropnans=False, save_raw=True, encoding_pairs=None, mode='fits', **log_kws)[source]

Class for invoking initial preprocessing of JWST calibration input data. :param Scrubber: spacekit.preprocessor.scrub.Scrubber parent class :type Scrubber: class

Initializes a JwstCalScrubber class object. :param input_path: path on local disk where L1 input exposures are located :type input_path: str or path :param data: dataframe of exposures to be preprocessed, by default None :type data: pd.DataFrame, optional :param pfx: limit scrape search to files starting with a given prefix such as ‘jw01018’, by default “” :type pfx: str, optional :param sfx: limit scrape search to files ending with a given suffix, by default “_uncal.fits” :type sfx: str, optional :param dropnans: drop null value columns, by default False :type dropnans: bool, optional :param save_raw: save a copy of the dataframe before encoding, by default True :type save_raw: bool, optional :param encoding_pairs: preset key-value pairs for encoding categorical data, by default None :type encoding_pairs: dict, optional :param mode: determines how data is scraped and handled (‘fits’ for files or ‘df’ for dataframe), by default ‘fits’ :type mode: str, optional


Assigns a fake target ID using TARGNAME, TARG_RA or GS_MAG. These IDs are fake in that they’re unlikely to match actual target IDs assigned later in the pipeline. For source-based exposures, the id is always “s00001”.


Remaining groups not matching above parameters default to ‘t0’ (typically ‘parallel_pure’ visitypes).


Group input metadata into pre-set data types before applying NaNdlers. :returns: key-value pairs of data type and exposure header / column name :rtype: dict


Determines potential L3 products based on groups of input exposures with matching Fits keywords prog+obs+optelem+fxd_slit+subarray. These groups are further subdivided and assigned a fake target ID by TARGNAME, GS_MAG or TARG_RA.

property input_data

Preprocessed input data grouped by exposure type :returns: input data grouped by exp_type (IMAGE, SPEC, FGS, TAC) :rtype: dict

make_image_product_name(k, v, tnum)[source]

Parse through exposure metadata to create expected L3 image products. :param k: exposure header key (L1 exposure name) :type k: str :param v: exposure header data :type v: dict :param tnum: number assigned to each unique target (targ_ra) within a program :type tnum: str

make_spec_product_name(k, v, tnum)[source]

Parse through exposure metadata to create expected L3 spectroscopy products. NOTE: Although the pipeline would create multiple products for either source-based exposures or (channel-based) MIRI MRS exposures, only one product name will be created since the model is concerned with RAM, i.e. how large the memory footprint is to calibrate a set of input exposures. Source-based products use “s00001” for the source; MIR_MRS exposures default to “ch4” for channel. :param k: exposure header key (L1 exposure name) :type k: str :param v: exposure header data :type v: dict :param tnum: number assigned to each unique target (targ_ra) within a program :type tnum: str

make_tac_product_name(k, v, p)[source]

If an image or spec product meets the required conditions, it is added instead to the TAC products dictionary (Time-series, AMI, Coronagraph). :param k: exposure header key (L1 exposure name) :type k: str :param v: exposure header data :type v: dict :param p: product name :type p: str


Generate the pixel offset between exposure reference pixels and the estimated L3 fiducial.


Scrape input exposure header metadata from fits files on local disk located at self.input_path.


Main calling function for preprocessing input exposures of a given exposure type. :param exp_type: Exposure type, by default “IMAGE” :type exp_type: str, optional


preprocessed data with renamed columns, NaNs scrubbed and categorical data encoded

Return type:



Certain L3 products need to be further defined by their L1 input TARG_RA values in addition to all other parameters. This only affects PRIME_TARGETED_FIXED visit types where TARGNAME != NaN. If multiple unique TARG_RA/DEC values (rounded to 6 digits) are identified within the group of exposures, we can assume each TARG grouping is a unique L3 product.

class spacekit.preprocessor.scrub.Scrubber(data=None, col_order=None, output_path=None, output_file=None, dropnans=True, save_raw=True, name='Scrubber', **log_kws)[source]

Base parent class for preprocessing data. Includes some basic column scrubbing methods for pandas dataframes. The heavy lifting is done via subclasses below.

save_csv_file(df=None, pfx='', index_col='index')[source]

Saves dataframe to csv file on local disk.


pfx (str, optional) – Insert a prefix at start of filename, by default “”


self.data_path where file is saved on disk.

Return type:
