spacekit.preprocessor.transform
- class spacekit.preprocessor.transform.SkyTransformer(mission, name='SkyTransformer', **log_kws)[source]
Bases:
object
_summary_
- Parameters:
- calculate_offsets(product_exp_headers)[source]
Given key-value pairs of header info from a set of input exposures, estimate the fiducial (center pixel coordinates) of the final image product and calculated pixel offset statistics between inputs and final output using detector-based footprints and sky separation angles.
NOTE: the product keys and input exposure keys could be any strings and are used simply for organization. The fits-related key-value pairs nested within each input exposure dictionary must contain, at minimum, the instrument and fiducial ra/dec coordinates (e.g. “INSTRUME”,”CRVAL1”,”CRVAL1”). The keys themselves can be custom set using
self.set_keys(**kwargs)
but must match the contents of the nested dictionary passed intoproduct_exp_headers
. Typically these are derived directly from fits file sci headers of the input exposures.Some missions and instruments require additional information such as “CHANNEL” (JWST Nircam) or “DETECTOR” (HST) in order to identify the correct pixel scale and footprint size based on the detector and/or wavelength channel.
- Parameters:
product_exp_headers (dict) – nested dictionary of (typically Level 3) product names (keys), their input exposures (values) and relevant fits header information per exposure (key-value pairs).
- set_keys(**kwargs)[source]
Set keys used in exposure header dictionary to identify values (typically derived from fits file sciheaders). Possible keyword arguments include: instr,detector,channel,ra,dec where ‘ra’,’dec’ refer to the fiducial (center pixel coordinate in degrees). None values will use defaults (see below); unrecognized kwargs will be ignored. Defaults: * instr=”INSTRUME” * detector=”DETECTOR” * channel=”CHANNEL” * band=”BAND” * exp_type=”EXP_TYPE” * ra=”CRVAL1” / could also use “RA_REF” * dec=”CRVAL2” / could also use “DEC_REF”
- class spacekit.preprocessor.transform.Transformer(data, cols=None, ncols=None, tx_data=None, tx_file=None, save_tx=True, join_data=1, rename='_scl', output_path=None, name='Transformer', **log_kws)[source]
Bases:
object
Initializes a Transformer class object. Unless the
cols
attribute is empty, it will automatically instantiate some of the other attributes needed to transform the data. Using the Transformer subclasses instead is recommended (this class is mainly used as an object with general methods to load or save the transform data as well as instantiate some of the initial attributes).- Parameters:
data (dataframe or numpy.ndarray) – input data containing continuous feature vectors to be transformed (may also contain vectors or columns of categorical and other datatypes as well).
transformer (class, optional) – transform class to use (e.g. from scikit-learn), by default PowerTransformer(standardize=False)
cols (list, optional) – column names or array index values of feature vectors to be transformed (i.e. continuous datatype features), by default []
tx_file (string, optional) – path to saved transformer metadata, by default None
save_tx (bool, optional) – save the transformer metadata as json file on local disk, by default True
join_data (int, optional) – 1: join normalized data with remaining columns of original; 2: join with complete original, all columns (requires renaming)
rename (str or list) – if string, will be appended to normalized col names; if list, will rename normalized columns in this order
output_path (string, optional) – where to save the transformer metadata, by default None (current working directory)
- categorical_data()[source]
Stores the other feature vectors in a separate variable (any leftover from
data
that are not incols
).- Returns:
“categorical” i.e. non-continuous feature vectors (as determined by
cols
attribute)- Return type:
dataframe or ndarray
- continuous_data()[source]
Store continuous feature vectors in a variable using the column names (or axis index if using numpy arrays) from
cols
attribute.- Returns:
continuous feature vectors (as determined by
cols
attribute)- Return type:
dataframe or ndarray
- load_transformer_data(tx=None)[source]
Loads saved transformer metadata from a dictionary or a json file on local disk.
- Returns:
transform metadata used for applying transformations on new data inputs
- Return type:
dictionary
- normalizeX(normalized)[source]
Combines original non-continuous features/vectors with the transformed/normalized data. Determines datatype (array or dataframe) and calls the appropriate method.
- Parameters:
- Returns:
array or dataframe of same shape and datatype as inputs, with continuous vectors/features normalized
- Return type:
ndarray or dataframe
- normalized_dataframe(normalized)[source]
Creates a new dataframe with the normalized data. Optionally combines with non-continuous vectors (original data) and appends
_scl
to the original column names for the ones that have been transformed.- Parameters:
- Returns:
dataframe of same shape as input data with continuous features normalized
- Return type:
dataframe
- normalized_matrix(normalized)[source]
Concatenates arrays of normalized data with original non-continuous data along the y-axis (axis=1).
- Parameters:
normalized (numpy.ndarray) – normalized data
- Returns:
array of same shape as input data, with continuous vectors normalized
- Return type:
- save_transformer_data(tx=None, fname='tx_data.json')[source]
Save the transform metadata to a json file on local disk. Typical use-case is when you need to transform new inputs prior to generating a prediction but don’t have access to the original dataset used to train the model.
- Parameters:
tx (dictionary) – statistical metadata calculated when applying a transform to the training dataset; for PowerTransform this consists of lambdas, means and standard deviations for each continuous feature vector of the dataset.
- Returns:
path where json file is saved on disk
- Return type:
string
- class spacekit.preprocessor.transform.PowerX(data, cols, ncols=None, tx_data=None, tx_file=None, save_tx=False, save_as='tx_data.json', output_path=None, join_data=1, rename='_scl', **log_kws)[source]
Bases:
Transformer
Applies Leo-Johnson PowerTransform (via scikit learn) normalization and scaling to continuous feature vectors of a dataframe or numpy array. The
tx_data
attribute can be instantiated from a json file, dictionary or the input data itself. The training and test sets should be normalized separately (i.e. distinct class objects) to prevent data leakage when training a machine learning model. Loading the transform metadata from a json file allows you to transform a new input array (e.g. for predictions) without needing to access the original dataframe.- Parameters:
Transformer (class) – spacekit.preprocessor.transform.Transformer parent class
- Returns:
spacekit.preprocessor.transform.PowerX power transform subclass
- Return type:
PowerX class object
Initializes a Transformer class object. Unless the
cols
attribute is empty, it will automatically instantiate some of the other attributes needed to transform the data. Using the Transformer subclasses instead is recommended (this class is mainly used as an object with general methods to load or save the transform data as well as instantiate some of the initial attributes).- Parameters:
data (dataframe or numpy.ndarray) – input data containing continuous feature vectors to be transformed (may also contain vectors or columns of categorical and other datatypes as well).
transformer (class, optional) – transform class to use (e.g. from scikit-learn), by default PowerTransformer(standardize=False)
cols (list, optional) – column names or array index values of feature vectors to be transformed (i.e. continuous datatype features), by default []
tx_file (string, optional) – path to saved transformer metadata, by default None
save_tx (bool, optional) – save the transformer metadata as json file on local disk, by default True
join_data (int, optional) – 1: join normalized data with remaining columns of original; 2: join with complete original, all columns (requires renaming)
rename (str or list) – if string, will be appended to normalized col names; if list, will rename normalized columns in this order
output_path (string, optional) – where to save the transformer metadata, by default None (current working directory)
- apply_power_matrix()[source]
Transforms the input data. This method assumes we already have
tx_data
and a fit-transformed input_matrix (array of continuous feature vectors), which normally is done automatically when the class object is instantiated andcalculate_power
is called.- Returns:
power transformed continuous feature vectors
- Return type:
ndarray
- calculate_power()[source]
Fits and transforms the continuous feature vectors using scikit learn PowerTransform. Calculates zero mean and unit variance for each vector as a separate step and stores these along with the lambdas in a dictionary
tx_data
attribute. This is so that the same normalization can be applied later for prediction inputs without requiring the original training data - otherwise it would be the same as using PowerTransform(standardize=True). Optionally, the calculated transform data can be stored in a json file on local disk.- Returns:
spacekit.preprocessor.transform.PowerX object with transformation metadata calculated for the input data and stored as attributes.
- Return type:
self
- fitX()[source]
Instantiates a scikit-learn PowerTransformer object and fits to the input data. If
tx_data
was passed as a kwarg or loaded fromtx_file
, the lambdas attribute for the transformer object will be updated to use these instead of calculated at the transform step.- Returns:
transformer fit to the data
- Return type:
PowerTransformer object
- get_lambdas()[source]
Instantiates the lambdas from file or dictionary if passed as kwargs; otherwise it uses the lambdas calculated in the transformX method. If transformX has not been called yet, returns None.
- Returns:
transform of multiple feature vectors returns an array of lambda values; otherwise a single vector returns a single (float) value.
- Return type:
ndarray or float
PowerX Examples
Calculate the normalization parameters of a dataframe (“training set”) using the Leo-Johnson PowerTransform and save the params to json file on local disk. Use this metadata (``PowerTransform.lambdas_``, mean, and standard deviation for each continuous feature vector) to transform new inputs (“test set”) in A) the same session or B) a separate session.
Example 1A: Normalize a Dataframe, Apply to Another Dataframe Separately
Px = PowerX(df, cols=["numexp", "rms_ra", "rms_dec", "nmatches", "point", "segment", "gaia"], save_tx=True)
dfX = PowerX(df2, cols=Px.cols, tx_data=PX.tx_data).Xt
Example 1B: Load saved transform data from json file, apply to new data (separate session)
tx_file = "data/tx_data"
Px = PowerX(df2, cols=["numexp", "rms_ra", "rms_dec"], tx_file=tx_file)
dfX = Px.Xt
Example 2: Normalize 2D numpy array (exclude specific axes)
# the last 3 columns are encoded categorical features so we exclude these columns
X = np.asarray([[143.,235.,10.4, 79., 0, 1, 0],[109.,262.,15.9, 63., 1, 0, 1]])
Px = PowerX(X, cols=[0,1,2,3])
Xt = Px.Xt
- spacekit.preprocessor.transform.normalize_training_data(df, cols, X_train, X_test, X_val=None, rename=None, output_path=None)[source]
Apply Leo-Johnson PowerTransform (via scikit learn) normalization and scaling to the training data, saving the transform metadata to json file on local disk and transforming the train, test and val sets separately (to prevent data leakage).
- Parameters:
df (pandas dataframe) – training dataset
cols (list) – column names or array index values of feature vectors to be transformed (i.e. continuous datatype features)
X_train (ndarray) – training set feature inputs array
X_test (ndarray) – test set feature inputs array
X_val (ndarray, optional) – validation set inputs array, by default None
- Returns:
normalized and scaled training, test, and validation sets
- Return type:
ndarrays
- spacekit.preprocessor.transform.normalize_training_images(X_tr, X_ts, X_vl=None)[source]
Scale image inputs so that all pixel values are converted to a decimal between 0 and 1 (divide by 255).
- Parameters:
X_tr (ndarray) – training set images
test (ndarray) – test set images
val (ndarray, optional) – validation set images, by default None
- Returns:
image set arrays
- Return type:
ndarrays
- spacekit.preprocessor.transform.arrays_to_tensors(X_train, y_train, X_test, y_test, reshape_y=False)[source]
Converts multiple numpy arrays into tensorflow tensor datatypes at once (for convenience).
- Parameters:
X_train (ndarray) – input training features
y_train (ndarray) – training target values
X_test (ndarray) – input test features
y_test (ndarray) – test target values
- Returns:
X_train, y_train, X_test, y_test
- Return type:
tensorflow.tensors
- spacekit.preprocessor.transform.tensor_to_array(tensor, reshape=False, shape=(-1, 1))[source]
Convert a tensor back into a numpy array. Optionally reshape the array (e.g. for target class data).
- Parameters:
tensor (tensor) – tensorflow tensor object
reshape (bool, optional) – reshapes the array (-1, 1) using numpy, by default False
- Returns:
array of same shape as input tensor, unless reshape=True
- Return type:
ndarray
- spacekit.preprocessor.transform.tensors_to_arrays(X_train, y_train, X_test, y_test)[source]
Converts tensors into arrays, which is necessary for certain regression analysis computations. The y_train and y_test args are reshaped using numpy.reshape(-1, 1).
- Parameters:
X_train (tensor) – training feature inputs
y_train (tensor) – training target outputs
X_test (tensor) – test feature inputs
y_test (tensor) – test target outputs
- Returns:
X_train, y_train, X_test, y_test
- Return type:
numpy.ndarrays
- spacekit.preprocessor.transform.hypersonic_pliers(path_to_train, path_to_test, y_col=[0], skip=1, dlm=',', subtract_y=0.0)[source]
Extracts data into 1-dimensional arrays, using separate target classes (y) for training and test data. Assumes y (target) is first column in dataframe. If the target (y) classes in the raw data are 0 and 2, but you’d like them to be binaries (0 and 1), set subtract_y=1.0
- Parameters:
path_to_train (string) – path to training data file (csv)
path_to_test (string) – path to test data file (csv)
y_col (list, optional) – axis index of target class, by default [0]
skip (int, optional) – skiprows parameter for np.loadtxt, by default 1
dlm (str, optional) – delimiter, by default “,”
subtract_y (float, optional) – subtract this value from all y-values, by default 1.0
- Returns:
X_train, X_test, y_train, y_test
- Return type:
np.ndarrays
- spacekit.preprocessor.transform.babel_fish_dispenser(matrix1, matrix2=None, step_size=None, axis=2)[source]
Adds an input corresponding to the running average over a set number of time steps. This helps the neural network to ignore high frequency noise by passing in a uniform 1-D filter and stacking the arrays.
- Parameters:
- Returns:
2D array (original input array with a uniform 1d-filter as noise)
- Return type:
numpy array(s)
- spacekit.preprocessor.transform.fast_fourier(matrix, bins)[source]
Takes an array (e.g. signal input values) and rotates number of
bins
to the left as a fast Fourier transform. Returns vector of length equal tomatrix
input array.- Parameters:
matrix (ndarray) – input values to transform
bins (int) – number of rotations
- Returns:
vector of length equal to
matrix
input array- Return type:
ndarray