spacekit.preprocessor.scrub

class spacekit.preprocessor.scrub.HstCalScrubber(data=None, output_path=None, output_file='batch.csv', dropnans=True, save_raw=True, **log_kws)[source]
class spacekit.preprocessor.scrub.HstSvmScrubber(input_path, data=None, output_path=None, output_file='svm_data', dropnans=True, save_raw=True, make_pos_list=True, crpt=0, make_subsamples=False, **log_kws)[source]

Class for invocating standard preprocessing steps of Single Visit Mosaic regression test data.

Parameters:
  • input_path (str or Path) – path to directory containing data input files

  • data (dataframe, optional) – dataframe containing raw inputs scraped from json (QA) files, by default None

  • output_path (str or Path, optional) – location to save preprocessed output files, by default None

  • output_file (str, optional) – file basename to assign preprocessed dataset, by default “svm_data”

  • dropnans (bool, optional) – find and remove any NaNs, by default True

  • save_raw (bool, optional) – save data as csv before any encoding is performed, by default True

  • make_pos_list (bool, optional) – create a text file listing misaligned (label=1) datasets, by default True

  • crpt (int, optional) – dataset contains synthetically corrupted data, by default 0

  • make_subsamples (bool, optional) – save a random selection of aligned (label=0) datasets to text file, by default False

add_crpt_labels()[source]

For new synthetic datasets, adds “label” target column and assigns value of 1 to all rows.

Returns:

self.df updated with label column (all values set = 1)

Return type:

dataframe

find_subsamples()[source]

Gets a varied sampling of dataframe observations and saves to local text file. This is one way of identifying a small subset for synthetic data generation.

make_pos_label_list()[source]

Looks for target class labels in dataframe and saves a text file listing index names of positive class. Originally this was to automate moving images into class labeled directories.

preprocess_data()[source]

Main calling function to run each preprocessing step for SVM regression data.

scrub_columns()[source]

Initial dataframe scrubbing to extract and rename columns, drop NaNs, and set the index.

scrub_qa_summary(csvfile='single_visit_mosaics*.csv', idx=0)[source]

Alternative if no .json files available (QA step not run during processing)

class spacekit.preprocessor.scrub.JwstCalScrubber(input_path, data=None, pfx='', sfx='_uncal.fits', dropnans=False, save_raw=True, encoding_pairs=None, **log_kws)[source]
class spacekit.preprocessor.scrub.Scrubber(data=None, col_order=None, output_path=None, output_file=None, dropnans=True, save_raw=True, name='Scrubber', **log_kws)[source]

Base parent class for preprocessing data. Includes some basic column scrubbing methods for pandas dataframes. The heavy lifting is done via subclasses below.

save_csv_file(df=None, pfx='', index_col='index')[source]

Saves dataframe to csv file on local disk.

Parameters:

pfx (str, optional) – Insert a prefix at start of filename, by default “”

Returns:

self.data_path where file is saved on disk.

Return type:

str