spacekit.preprocessor.scrub
- class spacekit.preprocessor.scrub.HstCalScrubber(data=None, output_path=None, output_file='batch.csv', dropnans=True, save_raw=True, **log_kws)[source]
- class spacekit.preprocessor.scrub.HstSvmScrubber(input_path, data=None, output_path=None, output_file='svm_data', dropnans=True, save_raw=True, make_pos_list=True, crpt=0, make_subsamples=False, **log_kws)[source]
Class for invocating standard preprocessing steps of Single Visit Mosaic regression test data.
- Parameters:
input_path (str or Path) – path to directory containing data input files
data (dataframe, optional) – dataframe containing raw inputs scraped from json (QA) files, by default None
output_path (str or Path, optional) – location to save preprocessed output files, by default None
output_file (str, optional) – file basename to assign preprocessed dataset, by default “svm_data”
dropnans (bool, optional) – find and remove any NaNs, by default True
save_raw (bool, optional) – save data as csv before any encoding is performed, by default True
make_pos_list (bool, optional) – create a text file listing misaligned (label=1) datasets, by default True
crpt (int, optional) – dataset contains synthetically corrupted data, by default 0
make_subsamples (bool, optional) – save a random selection of aligned (label=0) datasets to text file, by default False
- add_crpt_labels()[source]
For new synthetic datasets, adds “label” target column and assigns value of 1 to all rows.
- Returns:
self.df updated with label column (all values set = 1)
- Return type:
dataframe
- find_subsamples()[source]
Gets a varied sampling of dataframe observations and saves to local text file. This is one way of identifying a small subset for synthetic data generation.
- make_pos_label_list()[source]
Looks for target class labels in dataframe and saves a text file listing index names of positive class. Originally this was to automate moving images into class labeled directories.
- preprocess_data()[source]
Main calling function to run each preprocessing step for SVM regression data.
- class spacekit.preprocessor.scrub.JwstCalScrubber(input_path, data=None, pfx='', sfx='_uncal.fits', dropnans=False, save_raw=True, encoding_pairs=None, **log_kws)[source]
- class spacekit.preprocessor.scrub.Scrubber(data=None, col_order=None, output_path=None, output_file=None, dropnans=True, save_raw=True, name='Scrubber', **log_kws)[source]
Base parent class for preprocessing data. Includes some basic column scrubbing methods for pandas dataframes. The heavy lifting is done via subclasses below.