spacekit.preprocessor.transform

Inheritance diagram of spacekit.preprocessor.transform

class spacekit.preprocessor.transform.SkyTransformer(mission, name='SkyTransformer', **log_kws)[source]

Bases: object

_summary_

Parameters:
  • mission (str) – Name of mission or observatory, e.g. “JWST”, “HST”

  • product_exp_headers (dict, optional) – , by default None

  • name (str, optional) – logging name, by default “SkyTransformer”

calculate_offsets(product_exp_headers)[source]

Given key-value pairs of header info from a set of input exposures, estimate the fiducial (center pixel coordinates) of the final image product and calculated pixel offset statistics between inputs and final output using detector-based footprints and sky separation angles.

NOTE: the product keys and input exposure keys could be any strings and are used simply for organization. The fits-related key-value pairs nested within each input exposure dictionary must contain, at minimum, the instrument and fiducial ra/dec coordinates (e.g. “INSTRUME”,”CRVAL1”,”CRVAL1”). The keys themselves can be custom set using self.set_keys(**kwargs) but must match the contents of the nested dictionary passed into product_exp_headers. Typically these are derived directly from fits file sci headers of the input exposures.

Some missions and instruments require additional information such as “CHANNEL” (JWST Nircam) or “DETECTOR” (HST) in order to identify the correct pixel scale and footprint size based on the detector and/or wavelength channel.

Parameters:

product_exp_headers (dict) – nested dictionary of (typically Level 3) product names (keys), their input exposures (values) and relevant fits header information per exposure (key-value pairs).

data_shapes(instr)[source]
static estimate_fiducial(footprints: list)[source]
static footprint_from_shape(fiducial, scale, shape)[source]
get_pixel_offsets(exp_data)[source]
get_scale(instr, channel=None, detector=None, exp_type=None)[source]
image_pixel_scales()[source]
static offset_statistics(offsets, pfx='')[source]
static pixel_sky_separation(ra, dec, p_coords, scale, unit='deg')[source]
set_keys(**kwargs)[source]

Set keys used in exposure header dictionary to identify values (typically derived from fits file sciheaders). Possible keyword arguments include: instr,detector,channel,ra,dec where ‘ra’,’dec’ refer to the fiducial (center pixel coordinate in degrees). None values will use defaults (see below); unrecognized kwargs will be ignored. Defaults: * instr=”INSTRUME” * detector=”DETECTOR” * channel=”CHANNEL” * band=”BAND” * exp_type=”EXP_TYPE” * ra=”CRVAL1” / could also use “RA_REF” * dec=”CRVAL2” / could also use “DEC_REF”

validate_fiducial(fiducial, exp)[source]
class spacekit.preprocessor.transform.Transformer(data, cols=None, ncols=None, tx_data=None, tx_file=None, save_tx=True, join_data=1, rename='_scl', output_path=None, name='Transformer', **log_kws)[source]

Bases: object

Initializes a Transformer class object. Unless the cols attribute is empty, it will automatically instantiate some of the other attributes needed to transform the data. Using the Transformer subclasses instead is recommended (this class is mainly used as an object with general methods to load or save the transform data as well as instantiate some of the initial attributes).

Parameters:
  • data (dataframe or numpy.ndarray) – input data containing continuous feature vectors to be transformed (may also contain vectors or columns of categorical and other datatypes as well).

  • transformer (class, optional) – transform class to use (e.g. from scikit-learn), by default PowerTransformer(standardize=False)

  • cols (list, optional) – column names or array index values of feature vectors to be transformed (i.e. continuous datatype features), by default []

  • tx_file (string, optional) – path to saved transformer metadata, by default None

  • save_tx (bool, optional) – save the transformer metadata as json file on local disk, by default True

  • join_data (int, optional) – 1: join normalized data with remaining columns of original; 2: join with complete original, all columns (requires renaming)

  • rename (str or list) – if string, will be appended to normalized col names; if list, will rename normalized columns in this order

  • output_path (string, optional) – where to save the transformer metadata, by default None (current working directory)

categorical_data()[source]

Stores the other feature vectors in a separate variable (any leftover from data that are not in cols).

Returns:

“categorical” i.e. non-continuous feature vectors (as determined by cols attribute)

Return type:

dataframe or ndarray

check_columns(ncols=None)[source]
check_shape(data)[source]
continuous_data()[source]

Store continuous feature vectors in a variable using the column names (or axis index if using numpy arrays) from cols attribute.

Returns:

continuous feature vectors (as determined by cols attribute)

Return type:

dataframe or ndarray

load_transformer_data(tx=None)[source]

Loads saved transformer metadata from a dictionary or a json file on local disk.

Returns:

transform metadata used for applying transformations on new data inputs

Return type:

dictionary

normalizeX(normalized)[source]

Combines original non-continuous features/vectors with the transformed/normalized data. Determines datatype (array or dataframe) and calls the appropriate method.

Parameters:
  • normalized (dataframe or ndarray) – normalized data

  • join_data (bool, optional) – merge back with non-continuous data, by default True

  • rename (bool, optional) – append ‘_scl’ to normalized column names, by default True

Returns:

array or dataframe of same shape and datatype as inputs, with continuous vectors/features normalized

Return type:

ndarray or dataframe

normalized_dataframe(normalized)[source]

Creates a new dataframe with the normalized data. Optionally combines with non-continuous vectors (original data) and appends _scl to the original column names for the ones that have been transformed.

Parameters:
  • normalized (dataframe) – normalized feature vectors

  • join_data (bool, optional) – merge back with the original non-continuous data, by default True

  • rename (bool, optional) – append ‘_scl’ to normalized column names, by default True

Returns:

dataframe of same shape as input data with continuous features normalized

Return type:

dataframe

normalized_matrix(normalized)[source]

Concatenates arrays of normalized data with original non-continuous data along the y-axis (axis=1).

Parameters:

normalized (numpy.ndarray) – normalized data

Returns:

array of same shape as input data, with continuous vectors normalized

Return type:

numpy.ndarray

save_transformer_data(tx=None, fname='tx_data.json')[source]

Save the transform metadata to a json file on local disk. Typical use-case is when you need to transform new inputs prior to generating a prediction but don’t have access to the original dataset used to train the model.

Parameters:

tx (dictionary) – statistical metadata calculated when applying a transform to the training dataset; for PowerTransform this consists of lambdas, means and standard deviations for each continuous feature vector of the dataset.

Returns:

path where json file is saved on disk

Return type:

string

class spacekit.preprocessor.transform.PowerX(data, cols, ncols=None, tx_data=None, tx_file=None, save_tx=False, save_as='tx_data.json', output_path=None, join_data=1, rename='_scl', **log_kws)[source]

Bases: Transformer

Applies Leo-Johnson PowerTransform (via scikit learn) normalization and scaling to continuous feature vectors of a dataframe or numpy array. The tx_data attribute can be instantiated from a json file, dictionary or the input data itself. The training and test sets should be normalized separately (i.e. distinct class objects) to prevent data leakage when training a machine learning model. Loading the transform metadata from a json file allows you to transform a new input array (e.g. for predictions) without needing to access the original dataframe.

Parameters:

Transformer (class) – spacekit.preprocessor.transform.Transformer parent class

Returns:

spacekit.preprocessor.transform.PowerX power transform subclass

Return type:

PowerX class object

Initializes a Transformer class object. Unless the cols attribute is empty, it will automatically instantiate some of the other attributes needed to transform the data. Using the Transformer subclasses instead is recommended (this class is mainly used as an object with general methods to load or save the transform data as well as instantiate some of the initial attributes).

Parameters:
  • data (dataframe or numpy.ndarray) – input data containing continuous feature vectors to be transformed (may also contain vectors or columns of categorical and other datatypes as well).

  • transformer (class, optional) – transform class to use (e.g. from scikit-learn), by default PowerTransformer(standardize=False)

  • cols (list, optional) – column names or array index values of feature vectors to be transformed (i.e. continuous datatype features), by default []

  • tx_file (string, optional) – path to saved transformer metadata, by default None

  • save_tx (bool, optional) – save the transformer metadata as json file on local disk, by default True

  • join_data (int, optional) – 1: join normalized data with remaining columns of original; 2: join with complete original, all columns (requires renaming)

  • rename (str or list) – if string, will be appended to normalized col names; if list, will rename normalized columns in this order

  • output_path (string, optional) – where to save the transformer metadata, by default None (current working directory)

apply_power_matrix()[source]

Transforms the input data. This method assumes we already have tx_data and a fit-transformed input_matrix (array of continuous feature vectors), which normally is done automatically when the class object is instantiated and calculate_power is called.

Returns:

power transformed continuous feature vectors

Return type:

ndarray

calculate_power()[source]

Fits and transforms the continuous feature vectors using scikit learn PowerTransform. Calculates zero mean and unit variance for each vector as a separate step and stores these along with the lambdas in a dictionary tx_data attribute. This is so that the same normalization can be applied later for prediction inputs without requiring the original training data - otherwise it would be the same as using PowerTransform(standardize=True). Optionally, the calculated transform data can be stored in a json file on local disk.

Returns:

spacekit.preprocessor.transform.PowerX object with transformation metadata calculated for the input data and stored as attributes.

Return type:

self

fitX()[source]

Instantiates a scikit-learn PowerTransformer object and fits to the input data. If tx_data was passed as a kwarg or loaded from tx_file, the lambdas attribute for the transformer object will be updated to use these instead of calculated at the transform step.

Returns:

transformer fit to the data

Return type:

PowerTransformer object

get_lambdas()[source]

Instantiates the lambdas from file or dictionary if passed as kwargs; otherwise it uses the lambdas calculated in the transformX method. If transformX has not been called yet, returns None.

Returns:

transform of multiple feature vectors returns an array of lambda values; otherwise a single vector returns a single (float) value.

Return type:

ndarray or float

transformX()[source]

Applies a scikit-learn PowerTransform on the input data.

Returns:

continuous feature vectors transformed via scikit-learn PowerTransform

Return type:

ndarray

PowerX Examples

Calculate the normalization parameters of a dataframe (“training set”) using the Leo-Johnson PowerTransform and save the params to json file on local disk. Use this metadata (``PowerTransform.lambdas_``, mean, and standard deviation for each continuous feature vector) to transform new inputs (“test set”) in A) the same session or B) a separate session.

Example 1A: Normalize a Dataframe, Apply to Another Dataframe Separately

Px = PowerX(df, cols=["numexp", "rms_ra", "rms_dec", "nmatches", "point", "segment", "gaia"], save_tx=True)
dfX = PowerX(df2, cols=Px.cols, tx_data=PX.tx_data).Xt

Example 1B: Load saved transform data from json file, apply to new data (separate session)

tx_file = "data/tx_data"
Px = PowerX(df2, cols=["numexp", "rms_ra", "rms_dec"], tx_file=tx_file)
dfX = Px.Xt

Example 2: Normalize 2D numpy array (exclude specific axes)

# the last 3 columns are encoded categorical features so we exclude these columns
X = np.asarray([[143.,235.,10.4, 79., 0, 1, 0],[109.,262.,15.9, 63., 1, 0, 1]])
Px = PowerX(X, cols=[0,1,2,3])
Xt = Px.Xt
spacekit.preprocessor.transform.normalize_training_data(df, cols, X_train, X_test, X_val=None, rename=None, output_path=None)[source]

Apply Leo-Johnson PowerTransform (via scikit learn) normalization and scaling to the training data, saving the transform metadata to json file on local disk and transforming the train, test and val sets separately (to prevent data leakage).

Parameters:
  • df (pandas dataframe) – training dataset

  • cols (list) – column names or array index values of feature vectors to be transformed (i.e. continuous datatype features)

  • X_train (ndarray) – training set feature inputs array

  • X_test (ndarray) – test set feature inputs array

  • X_val (ndarray, optional) – validation set inputs array, by default None

Returns:

normalized and scaled training, test, and validation sets

Return type:

ndarrays

spacekit.preprocessor.transform.normalize_training_images(X_tr, X_ts, X_vl=None)[source]

Scale image inputs so that all pixel values are converted to a decimal between 0 and 1 (divide by 255).

Parameters:
  • X_tr (ndarray) – training set images

  • test (ndarray) – test set images

  • val (ndarray, optional) – validation set images, by default None

Returns:

image set arrays

Return type:

ndarrays

spacekit.preprocessor.transform.arrays_to_tensors(X_train, y_train, X_test, y_test, reshape_y=False)[source]

Converts multiple numpy arrays into tensorflow tensor datatypes at once (for convenience).

Parameters:
  • X_train (ndarray) – input training features

  • y_train (ndarray) – training target values

  • X_test (ndarray) – input test features

  • y_test (ndarray) – test target values

Returns:

X_train, y_train, X_test, y_test

Return type:

tensorflow.tensors

spacekit.preprocessor.transform.tensor_to_array(tensor, reshape=False, shape=(-1, 1))[source]

Convert a tensor back into a numpy array. Optionally reshape the array (e.g. for target class data).

Parameters:
  • tensor (tensor) – tensorflow tensor object

  • reshape (bool, optional) – reshapes the array (-1, 1) using numpy, by default False

Returns:

array of same shape as input tensor, unless reshape=True

Return type:

ndarray

spacekit.preprocessor.transform.tensors_to_arrays(X_train, y_train, X_test, y_test)[source]

Converts tensors into arrays, which is necessary for certain regression analysis computations. The y_train and y_test args are reshaped using numpy.reshape(-1, 1).

Parameters:
  • X_train (tensor) – training feature inputs

  • y_train (tensor) – training target outputs

  • X_test (tensor) – test feature inputs

  • y_test (tensor) – test target outputs

Returns:

X_train, y_train, X_test, y_test

Return type:

numpy.ndarrays

spacekit.preprocessor.transform.hypersonic_pliers(path_to_train, path_to_test, y_col=[0], skip=1, dlm=',', subtract_y=0.0)[source]

Extracts data into 1-dimensional arrays, using separate target classes (y) for training and test data. Assumes y (target) is first column in dataframe. If the target (y) classes in the raw data are 0 and 2, but you’d like them to be binaries (0 and 1), set subtract_y=1.0

Parameters:
  • path_to_train (string) – path to training data file (csv)

  • path_to_test (string) – path to test data file (csv)

  • y_col (list, optional) – axis index of target class, by default [0]

  • skip (int, optional) – skiprows parameter for np.loadtxt, by default 1

  • dlm (str, optional) – delimiter, by default “,”

  • subtract_y (float, optional) – subtract this value from all y-values, by default 1.0

Returns:

X_train, X_test, y_train, y_test

Return type:

np.ndarrays

spacekit.preprocessor.transform.babel_fish_dispenser(matrix1, matrix2=None, step_size=None, axis=2)[source]

Adds an input corresponding to the running average over a set number of time steps. This helps the neural network to ignore high frequency noise by passing in a uniform 1-D filter and stacking the arrays.

Parameters:
  • matrix1 (numpy array) – e.g. X_train

  • matrix2 (numpy array, optional) – e.g. X_test, by default None

  • step_size (int, optional) – timesteps for 1D filter (e.g. 200), by default None

  • axis (int, optional) – which axis to stack the arrays, by default 2

Returns:

2D array (original input array with a uniform 1d-filter as noise)

Return type:

numpy array(s)

spacekit.preprocessor.transform.fast_fourier(matrix, bins)[source]

Takes an array (e.g. signal input values) and rotates number of bins to the left as a fast Fourier transform. Returns vector of length equal to matrix input array.

Parameters:
  • matrix (ndarray) – input values to transform

  • bins (int) – number of rotations

Returns:

vector of length equal to matrix input array

Return type:

ndarray