tsfresh.feature_selection package

Submodules

tsfresh.feature_selection.relevance module

Contains a feature selection method that evaluates the importance of the different extracted features. To do so, for every feature the influence on the target is evaluated by an univariate tests and the p-Value is calculated. The methods that calculate the p-values are called feature selectors.

Afterwards the Benjamini Hochberg procedure which is a multiple testing procedure decides which features to keep and which to cut off (solely based on the p-values).

tsfresh.feature_selection.relevance.calculate_relevance_table(X, y, ml_task='auto', multiclass=False, n_significant=1, n_jobs=1, show_warnings=False, chunksize=None, test_for_binary_target_binary_feature='fisher', test_for_binary_target_real_feature='mann', test_for_real_target_binary_feature='mann', test_for_real_target_real_feature='kendall', fdr_level=0.05, hypotheses_independent=False)[source]

Calculate the relevance table for the features contained in feature matrix X with respect to target vector y. The relevance table is calculated for the intended machine learning task ml_task.

To accomplish this for each feature from the input pandas.DataFrame an univariate feature significance test is conducted. Those tests generate p values that are then evaluated by the Benjamini Hochberg procedure to decide which features to keep and which to delete.

We are testing

H_0 = the Feature is not relevant and should not be added

against

H_1 = the Feature is relevant and should be kept

or in other words

H_0 = Target and Feature are independent / the Feature has no influence on the target

H_1 = Target and Feature are associated / dependent

When the target is binary this becomes

H_0 = \left( F_{\text{target}=1} = F_{\text{target}=0} \right)

H_1 = \left( F_{\text{target}=1} \neq F_{\text{target}=0} \right)

Where F is the distribution of the target.

In the same way we can state the hypothesis when the feature is binary

H_0 =  \left( T_{\text{feature}=1} = T_{\text{feature}=0} \right)

H_1 = \left( T_{\text{feature}=1} \neq T_{\text{feature}=0} \right)

Here T is the distribution of the target.

TODO: And for real valued?

Parameters:
  • X (pandas.DataFrame) – Feature matrix in the format mentioned before which will be reduced to only the relevant features. It can contain both binary or real-valued features at the same time.

  • y (pandas.Series or numpy.ndarray) – Target vector which is needed to test which features are relevant. Can be binary or real-valued.

  • ml_task (str) – The intended machine learning task. Either ‘classification’, ‘regression’ or ‘auto’. Defaults to ‘auto’, meaning the intended task is inferred from y. If y has a boolean, integer or object dtype, the task is assumed to be classification, else regression.

  • multiclass (bool) – Whether the problem is multiclass classification. This modifies the way in which features are selected. Multiclass requires the features to be statistically significant for predicting n_significant classes.

  • n_significant (int) – The number of classes for which features should be statistically significant predictors to be regarded as ‘relevant’

  • test_for_binary_target_binary_feature (str) – Which test to be used for binary target, binary feature (currently unused)

  • test_for_binary_target_real_feature (str) – Which test to be used for binary target, real feature

  • test_for_real_target_binary_feature (str) – Which test to be used for real target, binary feature (currently unused)

  • test_for_real_target_real_feature (str) – Which test to be used for real target, real feature (currently unused)

  • fdr_level (float) – The FDR level that should be respected, this is the theoretical expected percentage of irrelevant features among all created features.

  • hypotheses_independent (bool) – Can the significance of the features be assumed to be independent? Normally, this should be set to False as the features are never independent (e.g. mean and median)

  • n_jobs (int) – Number of processes to use during the p-value calculation

  • show_warnings (bool) – Show warnings during the p-value calculation (needed for debugging of calculators).

  • chunksize (None or int) – The size of one chunk that is submitted to the worker process for the parallelisation. Where one chunk is defined as the data for one feature. If you set the chunksize to 10, then it means that one task is to filter 10 features. If it is set it to None, depending on distributor, heuristics are used to find the optimal chunksize. If you get out of memory exceptions, you can try it with the dask distributor and a smaller chunksize.

Returns:

A pandas.DataFrame with each column of the input DataFrame X as index with information on the significance of this particular feature. The DataFrame has the columns “feature”, “type” (binary, real or const), “p_value” (the significance of this feature as a p-value, lower means more significant) “relevant” (True if the Benjamini Hochberg procedure rejected the null hypothesis [the feature is not relevant] for this feature). If the problem is multiclass with n classes, the DataFrame will contain n columns named “p_value_CLASSID” instead of the “p_value” column. CLASSID refers here to the different values set in y. There will also be n columns named relevant_CLASSID, indicating whether the feature is relevant for that class.

Return type:

pandas.DataFrame

tsfresh.feature_selection.relevance.combine_relevance_tables(relevance_tables)[source]

Create a combined relevance table out of a list of relevance tables, aggregating the p-values and the relevances.

Parameters:

relevance_tables (List[pd.DataFrame]) – A list of relevance tables

Returns:

The combined relevance table

Return type:

pandas.DataFrame

tsfresh.feature_selection.relevance.get_feature_type(feature_column)[source]

For a given feature, determine if it is real, binary or constant. Here binary means that only two unique values occur in the feature.

Parameters:

feature_column (pandas.Series) – The feature column

Returns:

‘constant’, ‘binary’ or ‘real’

tsfresh.feature_selection.relevance.infer_ml_task(y)[source]

Infer the machine learning task to select for. The result will be either ‘regression’ or ‘classification’. If the target vector only consists of integer typed values or objects, we assume the task is ‘classification’. Else ‘regression’.

Parameters:

y (pandas.Series) – The target vector y.

Returns:

‘classification’ or ‘regression’

Return type:

str

tsfresh.feature_selection.selection module

This module contains the filtering process for the extracted features. The filtering procedure can also be used on other features that are not based on time series.

tsfresh.feature_selection.selection.select_features(X, y, test_for_binary_target_binary_feature='fisher', test_for_binary_target_real_feature='mann', test_for_real_target_binary_feature='mann', test_for_real_target_real_feature='kendall', fdr_level=0.05, hypotheses_independent=False, n_jobs=1, show_warnings=False, chunksize=None, ml_task='auto', multiclass=False, n_significant=1)[source]

Check the significance of all features (columns) of feature matrix X and return a possibly reduced feature matrix only containing relevant features.

The feature matrix must be a pandas.DataFrame in the format:

index

feature_1

feature_2

feature_N

A

B

Each column will be handled as a feature and tested for its significance to the target.

The target vector must be a pandas.Series or numpy.array in the form

index

target

A

B

.

.

and must contain all id’s that are in the feature matrix. If y is a numpy.array without index, it is assumed that y has the same order and length than X and the rows correspond to each other.

Examples

>>> from tsfresh.examples import load_robot_execution_failures
>>> from tsfresh import extract_features, select_features
>>> df, y = load_robot_execution_failures()
>>> X_extracted = extract_features(df, column_id='id', column_sort='time')
>>> X_selected = select_features(X_extracted, y)
Parameters:
  • X (pandas.DataFrame) – Feature matrix in the format mentioned before which will be reduced to only the relevant features. It can contain both binary or real-valued features at the same time.

  • y (pandas.Series or numpy.ndarray) – Target vector which is needed to test which features are relevant. Can be binary or real-valued.

  • test_for_binary_target_binary_feature (str) – Which test to be used for binary target, binary feature (currently unused)

  • test_for_binary_target_real_feature (str) – Which test to be used for binary target, real feature

  • test_for_real_target_binary_feature (str) – Which test to be used for real target, binary feature (currently unused)

  • test_for_real_target_real_feature (str) – Which test to be used for real target, real feature (currently unused)

  • fdr_level (float) – The FDR level that should be respected, this is the theoretical expected percentage of irrelevant features among all created features.

  • hypotheses_independent (bool) – Can the significance of the features be assumed to be independent? Normally, this should be set to False as the features are never independent (e.g. mean and median)

  • n_jobs (int) – Number of processes to use during the p-value calculation

  • show_warnings (bool) – Show warnings during the p-value calculation (needed for debugging of calculators).

  • chunksize (None or int) – The size of one chunk that is submitted to the worker process for the parallelisation. Where one chunk is defined as the data for one feature. If you set the chunksize to 10, then it means that one task is to filter 10 features. If it is set it to None, depending on distributor, heuristics are used to find the optimal chunksize. If you get out of memory exceptions, you can try it with the dask distributor and a smaller chunksize.

  • ml_task (str) – The intended machine learning task. Either ‘classification’, ‘regression’ or ‘auto’. Defaults to ‘auto’, meaning the intended task is inferred from y. If y has a boolean, integer or object dtype, the task is assumed to be classification, else regression.

  • multiclass (bool) – Whether the problem is multiclass classification. This modifies the way in which features are selected. Multiclass requires the features to be statistically significant for predicting n_significant features.

  • n_significant (int) – The number of classes for which features should be statistically significant predictors to be regarded as ‘relevant’. Only specify when multiclass=True

Returns:

The same DataFrame as X, but possibly with reduced number of columns ( = features).

Return type:

pandas.DataFrame

Raises:

ValueError when the target vector does not fit to the feature matrix or ml_task is not one of ‘auto’, ‘classification’ or ‘regression’.

tsfresh.feature_selection.significance_tests module

Contains the methods from the following paper about the FRESH algorithm [2]

Fresh is based on hypothesis tests that individually check the significance of every generated feature on the target. It makes sure that only features are kept, that are relevant for the regression or classification task at hand. FRESH decide between four settings depending if the features and target are binary or not.

The four functions are named

  1. target_binary_feature_binary_test(): Target and feature are both binary

  2. target_binary_feature_real_test(): Target is binary and feature real

  3. target_real_feature_binary_test(): Target is real and the feature is binary

  4. target_real_feature_real_test(): Target and feature are both real

References

tsfresh.feature_selection.significance_tests.target_binary_feature_binary_test(x, y)[source]

Calculate the feature significance of a binary feature to a binary target as a p-value. Use the two-sided univariate fisher test from fisher_exact() for this.

Parameters:
  • x (pandas.Series) – the binary feature vector

  • y (pandas.Series) – the binary target vector

Returns:

the p-value of the feature significance test. Lower p-values indicate a higher feature significance

Return type:

float

Raise:

ValueError if the target or the feature is not binary.

tsfresh.feature_selection.significance_tests.target_binary_feature_real_test(x, y, test)[source]

Calculate the feature significance of a real-valued feature to a binary target as a p-value. Use either the Mann-Whitney U or Kolmogorov Smirnov from mannwhitneyu() or ks_2samp() for this.

Parameters:
  • x (pandas.Series) – the real-valued feature vector

  • y (pandas.Series) – the binary target vector

  • test (str) – The significance test to be used. Either 'mann' for the Mann-Whitney-U test or 'smir' for the Kolmogorov-Smirnov test

Returns:

the p-value of the feature significance test. Lower p-values indicate a higher feature significance

Return type:

float

Raise:

ValueError if the target is not binary.

tsfresh.feature_selection.significance_tests.target_real_feature_binary_test(x, y)[source]

Calculate the feature significance of a binary feature to a real-valued target as a p-value. Use the Kolmogorov-Smirnov test from from ks_2samp() for this.

Parameters:
  • x (pandas.Series) – the binary feature vector

  • y (pandas.Series) – the real-valued target vector

Returns:

the p-value of the feature significance test. Lower p-values indicate a higher feature significance.

Return type:

float

Raise:

ValueError if the feature is not binary.

tsfresh.feature_selection.significance_tests.target_real_feature_real_test(x, y)[source]

Calculate the feature significance of a real-valued feature to a real-valued target as a p-value. Use Kendall’s tau from kendalltau() for this.

Parameters:
  • x (pandas.Series) – the real-valued feature vector

  • y (pandas.Series) – the real-valued target vector

Returns:

the p-value of the feature significance test. Lower p-values indicate a higher feature significance.

Return type:

float

Module contents

The feature_selection module contains feature selection algorithms. Those methods were suited to pick the best explaining features out of a massive amount of features. Often the features have to be picked in situations where one has more features than samples. Traditional feature selection methods can be not suitable for such situations which is why we propose a p-value based approach that inspects the significance of the features individually to avoid overfitting and spurious correlations.