tsfresh¶
This is the documentation of tsfresh.
tsfresh is a python package that is used to automatically calculate a huge number of time series characteristics, the so called features. Further the package contains methods to evaluate the explaining power and importance of such characteristics for regression or classification tasks.
Contents¶
The following chapters will explain the tsfresh package in detail:
Introduction¶
Why do you need such a module?¶
tsfresh is used to to extract characteristics from time series. Let’s assume you recorded the ambient temperature around your computer over one day as the following time series:

Now you want to calculate different characteristics such as the maximal or minimal temperature, the average temperature or the number of temporary temperature peaks:

Without tsfresh, you would have to calculate all those characteristics by hand. With tsfresh this process is automated and all those features can be calculated automatically.
Further tsfresh is compatible with pythons pandas
and scikit-learn
APIs, two important packages for Data Science endeavours
in python.
What to do with these features?¶
The extracted features can be used to describe or cluster time series based on the extracted characteristics. Further, they can be used to build models that perform classification/regression tasks on the time series. Often the features give new insights into time series and their dynamics.
The tsfresh package has been used successfully in projects involving
- the prediction of the life span of machines
- the prediction of the quality of steel billets during a continuous casting process
What not to do with tsfresh?¶
Currently, tsfresh is not suitable
- for usage with streaming data
- for batch processing over a distributed architecture, where different time series are fragmented over different computational units
- to train models on the features (we do not want to reinvent the wheel, check out the python package scikit-learn for example)
However, some of these use cases could be implemented, if you have an application in mind, open an issue at https://github.com/blue-yonder/tsfresh/issues, or feel free to contact us.
Quick Start¶
Install tsfresh¶
As the compiled tsfresh package is hosted on pypy you can easily install it with pip
pip install tsfresh
Dive in¶
Before boring yourself by reading the docs in detail, you can dive right into tsfresh with the following example:
We are given a data set containing robot failures as discussed in [1]. Each robot records time series from six different sensors. For each sample denoted by a different id we are going to classify if the robot reports a failure or not. From a machine learning point of view, our goal is to classify each group of time series.
To start, we load the data into python
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, \
load_robot_execution_failures
download_robot_execution_failures()
timeseries, y = load_robot_execution_failures()
and end up with a pandas.DataFrame timeseries having the following shape
id | time | a | b | c | d | e | f | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | -1 | -1 | 63 | -3 | -1 | 0 |
1 | 1 | 1 | 0 | 0 | 62 | -3 | -1 | 0 |
2 | 1 | 2 | -1 | -1 | 61 | -3 | 0 | 0 |
3 | 1 | 3 | -1 | -1 | 63 | -2 | -1 | 0 |
4 | 1 | 4 | -1 | -1 | 63 | -3 | -1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
The first column is the DataFrame index and has no meaning here. There are six different time series (a-f) for the different sensors. The different robots are denoted by the ids column.
On the other hand, y
contains the information which robot id reported a failure and which not:
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 0 |
... | ... |
Here, for the samples with ids 1 to 5 no failure was reported.
In the following we illustrate the time series of the sample id 3 reporting no failure:

And for id 20 reporting a failure:

You can already see some differences by eye - but for successful machines we have to put these differences into numbers.
For this, tsfresh comes into place. It allows us to automatically extract over 1200 features from those six different time series for each robot.
For extracting all features, we do:
from tsfresh import extract_features
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")
You end up with a dataframe extracted_features with all more than 1200 different extracted features.
We will now remove all NaN
values and select only the relevant features next
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute
impute(extracted_features)
features_filtered = select_features(extracted_features, y)
Only around 300 features were classified as relevant enough.
Further, you can even perform the extraction, imputing and filtering at the same time with the
tsfresh.extract_relevant_features()
function:
from tsfresh import extract_relevant_features
features_filtered_direct = extract_relevant_features(timeseries, y, column_id='id', column_sort='time')
You can now use the features contained in the Data Frame features_filtered (which is equal to features_filtered_direct) in conjunction with y to train your model. Please see the robot_failure_example.ipynb Jupyter Notebook in the folder named notebook. In this notebook a RandomForestClassifier is trained on the extracted features.
References
[1] | http://archive.ics.uci.edu/ml/datasets/Robot+Execution+Failures |
tsfresh¶
tsfresh package¶
Subpackages¶
tsfresh.convenience package¶
-
tsfresh.convenience.relevant_extraction.
extract_relevant_features
(timeseries_container, y, X=None, feature_extraction_settings=None, feature_selection_settings=None, column_id=None, column_sort=None, column_kind=None, column_value=None)[source]¶ High level convenience function to extract time series features from timeseries_container. Then return feature matrix X possibly augmented with features relevant with respect to target vector y.
For more details see the documentation of
extract_features()
andselect_features()
.Examples
>>> from tsfresh.examples import load_robot_execution_failures >>> from tsfresh import extract_relevant_features >>> df, y = load_robot_execution_failures() >>> X = extract_relevant_features(df, y, column_id='id', column_sort='time')
Parameters: - timeseries_container – See parameter timeseries_container in
extract_features()
- y – See parameter y in
select_features()
- X – See parameter X in
select_features()
- column_id – See parameter column_id in
extract_features()
- column_sort – See parameter column_sort in
extract_features()
- column_kind – See parameter column_kind in
extract_features()
- column_value – See parameter column_value in
extract_features()
- feature_extraction_settings – See parameter feature_extraction_settings in
extract_features()
- feature_selection_settings – See parameter feature_selection_settings in
select_features()
Returns: Feature matrix X, possibly extended with relevant time series features.
- timeseries_container – See parameter timeseries_container in
The convenience
submodule contains methods that allow the user to extract and filter features
conveniently.
tsfresh.examples package¶
This module implements functions to download and load the Human Activity Recognition dataset [4]. A description of the data set can be found in [5].
References
[4] | http://mlr.cs.umass.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones |
[5] | Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. (2013) A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013. |
This module implements functions to download the Robot Execution Failures LP1 Data Set[1] and load it as as DataFrame.
Important: You need to download the data set yourself, either manually or via the function
download_robot_execution_failures()
References
[1] | http://mlr.cs.umass.edu/ml/datasets/Robot+Execution+Failures |
[2] | Lichman, M. (2013). UCI Machine Learning Repository [http://mlr.cs.umass.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. |
[3] | Camarinha-Matos, L.M., L. Seabra Lopes, and J. Barata (1996). Integration and Learning in Supervision of Flexible Assembly Systems. “IEEE Transactions on Robotics and Automation”, 12 (2), 202-219 |
-
tsfresh.examples.robot_execution_failures.
download_robot_execution_failures
()[source]¶ Download the Robot Execution Failures LP1 Data Set[1] from the UCI Machine Learning Repository[2] and store it locally. :return:
Examples
>>> from tsfresh.examples import download_robot_execution_failures >>> download_robot_execution_failures_lp1()
-
tsfresh.examples.robot_execution_failures.
load_robot_execution_failures
()[source]¶ Load the Robot Execution Failures LP1 Data Set[1]. The Time series are passed as a flat DataFrame.
Examples
>>> from tsfresh.examples import load_robot_execution_failures >>> df, y = load_robot_execution_failures() >>> print(df.shape) (1320, 8)
Returns: time series data as pandas.DataFrame
and target vector aspandas.Series
Return type: tuple
Module with exemplary data sets to play around with.
See for eample the Quick Start section on how to use them.
tsfresh.feature_extraction package¶
This module contains the main function to interact with tsfresh: extract features
-
tsfresh.feature_extraction.extraction.
extract_features
(timeseries_container, feature_extraction_settings=None, column_id=None, column_sort=None, column_kind=None, column_value=None)[source]¶ Extract features from
- a
pandas.DataFrame
containing the different time series
or
- a dictionary of
pandas.DataFrame
each containing one type of time series
In both cases a
pandas.DataFrame
with the calculated features will be returned.For a list of all the calculated time series features, please see the
FeatureExtractionSettings
class, which is used to control which features with which parameters are calculated.For a detailed explanation of the different parameters and data formats please see Data Formats.
Examples
>>> from tsfresh.examples import load_robot_execution_failures >>> from tsfresh import extract_features >>> df, _ = load_robot_execution_failures() >>> X = extract_features(df, column_id='id', column_sort='time')
which would give the same results as described above. In this case, the column_kind is not allowed. Except that, the same rules for leaving out the columns apply as above.
Parameters: - timeseries_container (pandas.DataFrame or dict) – The pandas.DataFrame with the time series to compute the features for, or a dictionary of pandas.DataFrames.
- column_id (str) – The name of the id column to group by.
- column_sort (str) – The name of the sort column.
- column_kind (str) – The name of the column keeping record on the kind of the value.
- column_value (str) – The name for the column keeping the value itself.
- feature_extraction_settings (tsfresh.feature_extraction.settings.FeatureExtractionSettings) – settings object that controls which features are calculated
Returns: The (maybe imputed) DataFrame with the extracted features.
Return type: - a
This module contains the feature calculators that take time series as input and calculate the values of the feature. There are three types of features:
- aggregate features without parameter
- aggregate features with parameter
- apply features with parameters
While type 1 and 2 are designed to be used with pandas aggregate, they will only return one singular feature. To not unnecessarily redo auxiliary calculations, in type 3 a group of features is calculated at the same time. They can be used with pandas apply.
-
tsfresh.feature_extraction.feature_calculators.
abs_energy
(x, *arg, **args)[source]¶ Returns the absolute energy of the time series which is the sum over the squared values
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
absolute_sum_of_changes
(x, *arg, **args)[source]¶ Returns the sum over the absolute value of consecutive changes in the series x
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
approximate_entropy
(x, m, r)[source]¶ Implements a vectorized Approximate entropy algorithm.
For short time-series this method is highly dependent on the parameters, but should be stable for N > 2000, see:
Yentes et al. (2012) - The Appropriate Use of Approximate Entropy and Sample Entropy with Short Data SetsOther shortcomings and alternatives discussed in:
Richman & Moorman (2000) - Physiological time-series analysis using approximate entropy and sample entropyParameters: - x (pandas.Series) – the time series to calculate the feature of
- m (int) – Length of compared run of data
- r (float) – Filtering level, must be positive
Returns: Approximate entropy
Return type: float
This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
ar_coefficient
(x, *arg, **args)[source]¶ This feature calculator fit the unconditional maximum likelihood of an autoregressive AR(k) process. The k parameter is the maximum lag of the process
For the configurations from param which should contain the maxlag “k” and such an AR process is calculated. Then the coefficients
whose index
contained from “coeff” are returned.
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- c (str) – the time series name
- param (list) – contains dictionaries {“coeff”: x, “k”: y} with x,y int
Return x: the different feature values
Return type: pandas.Series
This function is of type: apply
-
tsfresh.feature_extraction.feature_calculators.
augmented_dickey_fuller
(x, *arg, **args)[source]¶ The Augmented Dickey-Fuller is a hypothesis test which checks whether a unit root is present in a time series sample. This feature calculator returns the value of the respective test statistic.
See the statsmodels implementation for references and more details.
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
autocorrelation
(x, *arg, **args)[source]¶ Calculates the lag autocorrelation of a lag value of lag.
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- lag (int) – the lag
Returns: the value of this feature
Return type: float
This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
binned_entropy
(x, *arg, **args)[source]¶ First bins the values of x into max_bins equidistant bins. Then calculates the value of
where
is the percentage of samples in bin
.
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- max_bins (int) – the maximal number of bins
Returns: the value of this feature
Return type: float
This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
count_above_mean
(x, *arg, **args)[source]¶ Returns the number of values in x that are higher than the mean of x
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
count_below_mean
(x, *arg, **args)[source]¶ Returns the number of values in x that are lower than the mean of x
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
cwt_coefficients
(x, *arg, **args)[source]¶ Calculates a Continuous wavelet transform for the Ricker wavelet, also known as the “Mexican hat wavelet” which is defined by
where
is the width parameter of the wavelet function.
This feature calculator takes three different parameter: widths, coeff and w. The feature calculater takes all the different widths arrays and then calculates the cwt one time for each different width array. Then the values for the different coefficient for coeff and width w are returned. (For each dic in param one feature is returned)
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- c (str) – the time series name
- param (list) – contains dictionaries {“widths”:x, “coeff”: y, “w”: z} with x array of int and y,z int
Returns: the different feature values
Return type: pandas.Series
This function is of type: apply
-
tsfresh.feature_extraction.feature_calculators.
fft_coefficient
(x, *arg, **args)[source]¶ Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input by fast fourier transformation algorithm
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- c (str) – the time series name
- param (list) – contains dictionaries {“coeff”: x} with x int and x >= 0
Returns: the different feature values
Return type: pandas.Series
This function is of type: apply
-
tsfresh.feature_extraction.feature_calculators.
first_location_of_maximum
(x, *arg, **args)[source]¶ Returns the first location of the maximum value of x. The position is calculated relatively to the length of x.
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
first_location_of_minimum
(x, *arg, **args)[source]¶ Returns the first location of the minimal value of x. The position is calculated relatively to the length of x.
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
has_duplicate
(x, *arg, **args)[source]¶ Checks if any value in x occurs more than once
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: bool This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
has_duplicate_max
(x, *arg, **args)[source]¶ Checks if the maximum value of x is observed more than once
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: bool This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
has_duplicate_min
(x, *arg, **args)[source]¶ Checks if the minimal value of x is observed more than once
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: bool This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
index_mass_quantile
(x, *arg, **args)[source]¶ Those apply features calculate the relative index i where q% of the mass of the time series x lie left of i. For example for q = 50% this feature calculator will return the mass center of the time series
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- c (str) – the time series name
- param (list) – contains dictionaries {“q”: x} with x float
Returns: the different feature values
Return type: pandas.Series
This function is of type: apply
-
tsfresh.feature_extraction.feature_calculators.
kurtosis
(x, *arg, **args)[source]¶ Returns the kurtosis of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G2).
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
large_number_of_peaks
(x, *arg, **args)[source]¶ Checks if the number of peaks is higher than n.
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- n (int) – the number of peaks to compare
Returns: the value of this feature
Return type: bool
This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
large_standard_deviation
(x, *arg, **args)[source]¶ Boolean variable denoting if the variance of x is higher than half of the range, calculated as the half the difference between max and min of x. Hence it checks if
According to a rule of the thumb, the standard deviation should be a forth of the range of the values.
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- r (float) – the percentage of the range to compare with
Returns: the value of this feature
Return type: bool
This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
last_location_of_maximum
(x, *arg, **args)[source]¶ Returns the relative last location of the maximum value of x. The position is calculated relatively to the length of x.
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
last_location_of_minimum
(x, *arg, **args)[source]¶ Returns the last location of the minimal value of x. The position is calculated relatively to the length of x.
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
length
(x, *arg, **args)[source]¶ Returns the length of x
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: int This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
longest_strike_above_mean
(x, *arg, **args)[source]¶ Returns the length of the longest consecutive subsequence that in x that is bigger than the mean of x
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
longest_strike_below_mean
(x, *arg, **args)[source]¶ Returns the length of the longest consecutive subsequence that in x that is smaller than the mean of x
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
maximum
(x)[source]¶ Calculates the highest value of the time series x.
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
mean
(x)[source]¶ Returns the mean of x
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
mean_abs_change
(x, *arg, **args)[source]¶ Returns the mean over the absolute differences between subsequent time series values which is
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
mean_abs_change_quantiles
(x, *arg, **args)[source]¶ First fixes a corridor given by the quantiles ql and qh of the distribution of x. Then calculates the average absolute value of consecutive changes of the series x inside this corridor. Think about selecting a corridor on the y-Axis and only calculating the mean of the absolute change of the time series inside this corridor.
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- ql (float) – the lower quantile of the corridor
- qh (float) – the higher quantile of the corridor
Returns: the value of this feature
Return type: float
This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
mean_autocorrelation
(x, *arg, **args)[source]¶ Calculates the average autocorrelation (Compare to http://en.wikipedia.org/wiki/Autocorrelation#Estimation), taken over different all possible lags (1 to length of x)
where
is the length of the time series
,
its variance and
its mean.
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
mean_change
(x, *arg, **args)[source]¶ Returns the mean over the absolute differences between subsequent time series values which is
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
mean_second_derivate_central
(x, *arg, **args)[source]¶ Returns the mean value of an central approximation of the second derivate
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
median
(x)[source]¶ Returns the median of x
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
minimum
(x)[source]¶ Calculates the lowest value of the time series x.
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
not_apply_to_raw_numbers
(func)[source]¶ This decorator makes sure that the function func is only called on objects that are not numbers.Number
Parameters: func – the method that should only be executed on objects which are not a numbers.Number Returns: the decorated version of func which returns 0 if the first argument x is a numbers.Number. For every other x the output of func is returned
-
tsfresh.feature_extraction.feature_calculators.
number_cwt_peaks
(x, *arg, **args)[source]¶ This feature calculator searches for different peaks in x. To do so, x is smoothed by a ricker wavelet and for widths ranging from 1 to n. This feature calculator returns the number of peaks that occur at enough width scales and with sufficiently high Signal-to-Noise-Ratio (SNR)
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- n (int) – maximum width to consider
Returns: the value of this feature
Return type: int
This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
number_peaks
(x, *arg, **args)[source]¶ Calculates the number of peaks of at least support n in the time series x. A peak of support n is defined as a subsequence of x where a value occurs, which is bigger than its n neighbours to the left and to the right.
Hence in the sequence
>>> x = [3, 0, 0, 4, 0, 0, 13]
4 is a peak of support 1 and 2 because in the subsequences
>>> [0, 4, 0] >>> [0, 0, 4, 0, 0]
4 is still the highest value. Here, 4 is not a peak of support 3 because 13 is the 3th neighbour to the right of 4 and its bigger than 4.
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- n (int) – the support of the peak
Returns: the value of this feature
Return type: float
This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
percentage_of_reoccurring_datapoints_to_all_datapoints
(x, *arg, **args)[source]¶ Returns the percentage of unique values, that are present in the time series more than once.
len(different values occurring more than once) / len(different values)This means the percentage is normalized to the number of unique values, in contrast to the percentage_of_reoccurring_values_to_all_values.
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
percentage_of_reoccurring_values_to_all_values
(x, *arg, **args)[source]¶ Returns the ratio of unique values, that are present in the time series more than once.
# of data points occurring more than once / # of all data pointsThis means the ratio is normalized to the number of data points in the time series, in contrast to the percentage_of_reoccurring_datapoints_to_all_datapoints.
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
quantile
(x, *arg, **args)[source]¶ Calculates the q quantile of x. This is the value of x such that q% of the ordere values from x are lower than.
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- q (float) – the quantile to calculate
Returns: the value of this feature
Return type: float
This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
range_count
(x, min, max)[source]¶ Count observed values within the interval [min, max).
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- min (int or float) – the inclusive lower bound of the range
- max (int or float) – the exclusive upper bound of the range
Returns: the count of values within the range
Return type: This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
ratio_value_number_to_time_series_length
(x, *arg, **args)[source]¶ Returns a factor which is 1 if all values in the time series occur only once, and below one if this is not the case. In principle, it just returns
# unique values / # valuesParameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
sample_entropy
(x)[source]¶ Calculate and return sample entropy of x. References: ———- [1] http://en.wikipedia.org/wiki/Sample_Entropy [2] https://www.ncbi.nlm.nih.gov/pubmed/10843903?dopt=Abstract
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- tolerance (float) – normalization factor; equivalent to the common practice of expressing the tolerance as r times the standard deviation
Returns: the value of this feature
Return type: float
This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
set_property
(key, value)[source]¶ This method returns a decorator that sets the property key of the function to value
-
tsfresh.feature_extraction.feature_calculators.
skewness
(x, *arg, **args)[source]¶ Returns the sample skewness of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G1).
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
spkt_welch_density
(x, *arg, **args)[source]¶ This feature calculator estimates the cross power spectral density of the time series x at different frequencies. To do so, first the time series is shifted from the time domain to the frequency domain.
The feature calculators returns the power spectrum of the different frequencies.
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- c (str) – the time series name
- param (list) – contains dictionaries {“coeff”: x} with x int
Returns: the different feature values
Return type: pandas.Series
This function is of type: apply
-
tsfresh.feature_extraction.feature_calculators.
standard_deviation
(x, *arg, **args)[source]¶ Returns the standard deviation of x
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
sum_of_reoccurring_values
(x, *arg, **args)[source]¶ Returns the sum of all values, that are present in the time series more than once.
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
sum_values
(x)[source]¶ Calculates the sum over the time series values
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: bool This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
symmetry_looking
(x, *arg, **args)[source]¶ Boolean variable denoting if the distribution of x looks symmetric. This is the case if
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- r (float) – the percentage of the range to compare with
Returns: the value of this feature
Return type: bool
This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
time_reversal_asymmetry_statistic
(x, *arg, **args)[source]¶ This function calculates the value of
which is
where
is the mean and
is the lag operator. It was proposed as a proposed in [1] as a promising feature to extract from time series.
References
[1] Fulcher, B.D., Jones, N.S. (2014). Highly comparative feature-based time-series classification. Knowledge and Data Engineering, IEEE Transactions on 26, 3026–3037. Parameters: - x (pandas.Series) – the time series to calculate the feature of
- lag (int) – the lag that should be used in the calculation of the feature
Returns: the value of this feature
Return type: float
This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
value_count
(x, value)[source]¶ Count occurrences of value in time series x.
Parameters: - x (pandas.Series) – the time series to calculate the feature of
- value (int or float) – the value to be counted
Returns: the count
Return type: This function is of type: aggregate_with_parameters
-
tsfresh.feature_extraction.feature_calculators.
variance
(x, *arg, **args)[source]¶ Returns the variance of x
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: float This function is of type: aggregate
-
tsfresh.feature_extraction.feature_calculators.
variance_larger_than_standard_deviation
(x, *arg, **args)[source]¶ Boolean variable denoting if the variance of x is greater than its standard deviation. Is equal to variance of x being larger than 1.
Parameters: x (pandas.Series) – the time series to calculate the feature of Returns: the value of this feature Return type: bool This function is of type: aggregate
This file contains all settings of the tsfresh. For the naming of the features, see Feature Calculation.
-
class
tsfresh.feature_extraction.settings.
FeatureExtractionSettings
(calculate_all_features=True)[source]¶ Bases:
future.types.newobject.newobject
This class defines the behaviour of feature extraction, in particular which feature and parameter combinations are calculated. If you do not specify any user settings, all features will be extracted with default arguments defined in this class.
In general, we consider three types of time series features:
- aggregate features without parameter that emit exactly one feature per function calculator
- aggregate features with parameter that emit exactly one feature per function calculator
- apply features with parameters that emit several features per function calculator (usually one feature per parameter value)
These three types are stored in different dictionaries. For the feature types with parameters there is also a dictionaries containing the parameters.
It is possible to obtain a FeatureExtractionSettings object from a feature matrix, see func:~tsfresh.feature_extraction.settings.FeatureExtractionSettings.from_columns. This is useful to reproduce the features of a train set for a test set.
To set user defined settings, do something like
>>> from tsfresh.feature_extraction import FeatureExtractionSettings >>> settings = FeatureExtractionSettings() >>> # Calculate all features except length >>> settings.do_not_calculate("length") >>> from tsfresh.feature_extraction import extract_features >>> extract_features(df, feature_extraction_settings=settings)
Mostly, the settings in this class are for enabling/disabling the extraction of certain features, which can be important to save time during feature extraction. Additionally, some of the features have parameters which can be controlled here.
If the calculation of a feature failed (for whatever reason), the results can be NaN. The IMPUTE flag defaults to None and can be set to one of the impute functions in
dataframe_functions
.-
do_not_calculate
(kind, identifier)[source]¶ Delete the all features of type identifier for time series of type kind.
Parameters: - kind (basestring) – the type of the time series
- identifier (basestring) – the name of the feature
Returns: The setting object itself
Return type:
-
static
from_columns
(columns)[source]¶ Creates a FeatureExtractionSettings object set to extract only the features contained in the list columns. to do so, for every feature name in columns this method
- split the column name into col, feature, params part
- decide which feature we are dealing with (aggregate with/without params or apply)
- add it to the new name_to_function dict
- set up the params
Set the feature and params dictionaries in the settings object, then return it.
Parameters: columns (list of str) – containing the feature names Returns: The changed settings object Return type: FeatureExtractionSettings
-
get_aggregate_functions
(kind)[source]¶ For the tyme series Returns a dictionary with the column name mapped to the feature calculators that are specified in the FeatureExtractionSettings object. This dictionary can be used in a pandas group by command to extract the all aggregate features at the same time.
Parameters: kind (basestring) – the type of the time series Returns: mapping of column name to function calculator Return type: dict
-
get_apply_functions
(column_prefix)[source]¶ Convenience function to return a list with all the functions to apply on a data frame and extract features. Only adds those functions to the dictionary, that are enabled in the settings.
Parameters: column_prefix (basestring) – the prefix all column names. Returns: all functions to use for feature extraction Return type: list
-
static
get_config_from_string
(parts)[source]¶ Helper function to extract the configuration of a certain function from the column name. The column name parts (split by “__”) should be passed to this function. It will skip the kind name and the function name and only use the parameter parts. These parts will be split up on “_” into the parameter name and the parameter value. This value is transformed into a python object (for example is “(1, 2, 3)” transformed into a tuple consisting of the ints 1, 2 and 3).
Parameters: parts (list) – The column name split up on “__” Returns: a dictionary with all parameters, which are encoded in the column name. Return type: dict
-
class
tsfresh.feature_extraction.settings.
MinimalFeatureExtractionSettings
[source]¶ Bases:
tsfresh.feature_extraction.settings.FeatureExtractionSettings
This class is a parent class of the FeatureExtractionSettings class and has the same functionality as its base class. The only difference is, that most of the feature calculators are disabled and only a small subset of calculators will be calculated at all.
Use this class for quick tests of your setup before calculating all features which could take some time depending of your data set size.
You should use this object when calling the extract function, like so:
>>> from tsfresh.feature_extraction import extract_features, MinimalFeatureExtractionSettings >>> extract_features(df, feature_extraction_settings=MinimalFeatureExtractionSettings)
The tsfresh.feature_extraction
module contains methods to extract the features from the time series
tsfresh.feature_selection package¶
Contains a feature selection method that evaluates the importance of the different extracted features. To do so, for every feature the influence on the target is evaluated by an univariate tests and the p-Value is calculated. The methods that calculate the p-values are called feature selectors.
Afterwards the Benjamini Hochberg procedure which is a multiple testing procedure decides which features to keep and which to cut off (solely based on the p-values).
-
tsfresh.feature_selection.feature_selector.
benjamini_hochberg_test
(df_pvalues, settings)[source]¶ This is an implementation of the benjamini hochberg procedure that calculates which of the hypotheses belonging to the different p-Values from df_p to reject. While doing so, this test controls the false discovery rate, which is the ratio of false rejections by all rejections:
References
[1] Benjamini, Yoav and Yekutieli, Daniel (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165–1188 Parameters: - df_pvalues (pandas.DataFrame) – This DataFrame should contain the p_values of the different hypotheses in a column named “p_values”.
- settings (FeatureSignificanceTestsSettings) – The settings object to use for controlling the false discovery rate (FDR_level) and whether to threat the hypothesis independent or not (hypotheses_independent).
Returns: The same DataFrame as the input, but with an added boolean column “rejected”.
Return type:
-
tsfresh.feature_selection.feature_selector.
check_fs_sig_bh
(X, y, settings=None)[source]¶ The wrapper function that calls the significance test functions in this package. In total, for each feature from the input pandas.DataFrame an univariate feature significance test is conducted. Those tests generate p values that are then evaluated by the Benjamini Hochberg procedure to decide which features to keep and which to delete.
We are testing
= the Feature is not relevant and can not be added
against
= the Feature is relevant and should be kept
or in other words
= Target and Feature are independent / the Feature has no influence on the target
= Target and Feature are associated / dependent
When the target is binary this becomes
Where
is the distribution of the target.
In the same way we can state the hypothesis when the feature is binary
Here
is the distribution of the target.
TODO: And for real valued?
Parameters: - X (pandas.DataFrame) – The DataFrame containing all the features and the target
- y (pandas.Series) – The target vector
- settings (FeatureSignificanceTestsSettings) – The feature selection settings to use for performing the tests.
Returns: A pandas.DataFrame with each column of the input DataFrame X as index with information on the significance of this particular feature. The DataFrame has the columns “Feature”, “type” (binary, real or const), “p_value” (the significance of this feature as a p-value, lower means more significant) “rejected” (if the Benjamini Hochberg procedure rejected this feature)
Return type:
This module contains the filtering process for the extracted features. The filtering procedure can also be used on other features that are not based on time series.
-
tsfresh.feature_selection.selection.
select_features
(X, y, feature_selection_settings=None)[source]¶ Check the significance of all features (columns) of feature matrix X and return a possibly reduced feature matrix only containing relevant features.
The feature matrix must be a pandas.DataFrame in the format:
index feature_1 feature_2 ... feature_N A ... ... ... ... B ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Each column will be handled as a feature and tested for its significance to the target.
The target vector must be a pandas.Series or numpy.array in the form
index target A ... B ... . ... . ... and must contain all id’s that are in the feature matrix. If y is a numpy.array without index, it is assumed that y has the same order and length than X and the rows correspond to each other.
Examples
>>> from tsfresh.examples import load_robot_execution_failures >>> from tsfresh import extract_features, select_features >>> df, y = load_robot_execution_failures() >>> X_extracted = extract_features(df, column_id='id', column_sort='time') >>> X_selected = select_features(X_extracted, y)
Parameters: - X (pandas.DataFrame) – Feature matrix in the format mentioned before which will be reduced to only the relevant features. It can contain both binary or real-valued features at the same time.
- y (pandas.Series or numpy.ndarray) – Target vector which is needed to test, which features are relevant. Can be binary or real-valued.
- feature_selection_settings (FeatureSignificanceTestsSettings) – The settings to control the feature selection algorithms. See
py
for more information. If none is passed, the defaults will be used.
Returns: The same DataFrame as X, but possibly with reduced number of columns ( = features).
Return type: Raises: ValueError
when the target vector does not fit to the feature matrix.
-
class
tsfresh.feature_selection.settings.
FeatureSignificanceTestsSettings
[source]¶ Bases:
future.types.newobject.newobject
The settings object for controlling the feature significance tests. Normally, you do not have to handle these settings on your own, as the chosen defaults are quite sensible.
This object is passed to mostly all functions in the feature_selection submodules.
If you want non-default settings, create a new settings object and pass it to the functions, for example if you want a less conservative selection of features you could increase the fdr level to 10%.
>>> from tsfresh.feature_selection import FeatureSignificanceTestsSettings >>> settings = FeatureSignificanceTestsSettings() >>> settings.fdr_level = 0.1 >>> from tsfresh.feature_selection import select_features >>> select_features(X, y, feature_selection_settings=settings)
This selection process will return more features as the fdr level was raised.
-
fdr_level
= None¶ The FDR level that should be respected, this is the theoretical expected percentage of irrelevant features among all created features. E.g.
-
hypotheses_independent
= None¶ Can the significance of the features be assumed to be independent? Normally, this should be set to False as the features are never independent (think about mean and median)
-
n_processes
= None¶ Number of processes to use during the p-value calculation
-
result_dir
= None¶ Where to store the selection import
-
test_for_binary_target_binary_feature
= None¶ Which test to be used for binary target, binary feature (unused)
-
test_for_binary_target_real_feature
= None¶ Which test to be used for binary target, real feature
-
test_for_real_target_binary_feature
= None¶ Which test to be used for real target, binary feature (unused)
-
test_for_real_target_real_feature
= None¶ Which test to be used for real target, real feature (unused)
-
write_selection_report
= None¶ Whether to store the selection report after the Benjamini Hochberg procedure has finished.
-
Contains the methods from the following paper about FRESH [2]
Fresh is based on hypothesis tests that individually check the significance of every generated feature on the target. It makes sure that only features are kept, that are relevant for the regression or classification task at hand. FRESH decide between four settings depending if the features and target are binary or not.
The four functions are named
target_binary_feature_binary_test()
: Target and feature are both binarytarget_binary_feature_real_test()
: Target is binary and feature realtarget_real_feature_binary_test()
: Target is real and the feature is binarytarget_real_feature_real_test()
: Target and feature are both real
References
[2] | Christ, M., Kempa-Liehr, A.W. and Feindt, M. (2016). Distributed and parallel time series feature extraction for industrial big data applications. ArXiv e-prints: 1610.07717 https://arxiv.org/abs/1610.07717 |
-
tsfresh.feature_selection.significance_tests.
target_binary_feature_binary_test
(x, y, settings=None)[source]¶ Calculate the feature significance of a binary feature to a binary target as a p-value. Use the two-sided univariate fisher test from
fisher_exact()
for this.Parameters: - x (pandas.Series) – the binary feature vector
- y (pandas.Series) – the binary target vector
- settings (FeatureSignificanceTestsSettings or None) – The settings object to control how the significance is calculated (currently unused).
Returns: the p-value of the feature significance test. Lower p-values indicate a higher feature significance
Return type: Raise: ValueError
if the target or the feature is not binary.
-
tsfresh.feature_selection.significance_tests.
target_binary_feature_real_test
(x, y, settings)[source]¶ Calculate the feature significance of a real-valued feature to a binary target as a p-value. Use either the Mann-Whitney U or Kolmogorov Smirnov from
mannwhitneyu()
orks_2samp()
for this.Parameters: - x (pandas.Series) – the real-valued feature vector
- y (pandas.Series) – the binary target vector
- settings (FeatureSignificanceTestsSettings) – The settings object to control how the significance is calculated.
Returns: the p-value of the feature significance test. Lower p-values indicate a higher feature significance
Return type: Raise: ValueError
if the target is not binary.
-
tsfresh.feature_selection.significance_tests.
target_real_feature_binary_test
(x, y, settings=None)[source]¶ Calculate the feature significance of a binary feature to a real-valued target as a p-value. Use the Kolmogorov-Smirnov test from from
ks_2samp()
for this.Parameters: - x (pandas.Series) – the binary feature vector
- y (pandas.Series) – the real-valued target vector
- settings (FeatureSignificanceTestsSettings or None) – The settings object to control how the significance is calculated (currently unused).
Returns: the p-value of the feature significance test. Lower p-values indicate a higher feature significance.
Return type: Raise: ValueError
if the feature is not binary.
-
tsfresh.feature_selection.significance_tests.
target_real_feature_real_test
(x, y, settings=None)[source]¶ Calculate the feature significance of a real-valued feature to a real-valued target as a p-value. Use Kendall’s tau from
kendalltau()
for this.Parameters: - x (pandas.Series) – the real-valued feature vector
- y (pandas.Series) – the real-valued target vector
- settings (FeatureSignificanceTestsSettings or None) – The settings object to control how the significance is calculated (currently unused).
Returns: the p-value of the feature significance test. Lower p-values indicate a higher feature significance.
Return type:
The feature_selection
module contains feature selection algorithms.
Those methods were suited to pick the best explaining features out of a massive amount of features.
Often the features have to be picked in situations where one has more features than samples.
Traditional feature selection methods can be not suitable for such situations which is why we propose a p-value based
approach that inspects the significance of the features individually to avoid overfitting and spurious correlations.
tsfresh.scripts package¶
Run the script with: ``` python run_tsfresh.py path_to_your_csv.csv
- Currently this only samples to first 50 values.
- Your csv must be space delimited.
- Output is saved as path_to_your_csv.features.csv
`
e.g.:
`
python run_tsfresh.py data.txt
```
A corresponding csv containing time series features will be saved as features_path_to_your_csv.csv
tsfresh.transformers package¶
-
class
tsfresh.transformers.feature_augmenter.
FeatureAugmenter
(settings=None, column_id=None, column_sort=None, column_kind=None, column_value=None, timeseries_container=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Sklearn-compatible estimator, for calculating and adding many features calculated from a given time series to the data. Is is basically a wrapper around
extract_features()
.The features include basic ones like min, max or median, and advanced features like fourier transformations or statistical tests. For a list of all possible features, see the module
feature_calculators
. The column name of each added feature contains the name of the function of that module, which was used for the calculation.For this estimator, two datasets play a crucial role:
- the time series container with the timeseries data. This container (for the format see Data Formats) contains the data which is used for calculating the features. It must be groupable by ids which are used to identify which feature should be attached to which row in the second dataframe:
- the input data, where the features will be added to.
Imagine the following situation: You want to classify 10 different financial shares and you have their development in the last year as a time series. You would then start by creating features from the metainformation of the shares, e.g. how long they were on the market etc. and filling up a table - the features of one stock in one row.
>>> df = pandas.DataFrame() >>> # Fill in the information of the stocks >>> df["started_since_days"] = 0 # add a feature
You can then extract all the features from the time development of the shares, by using this estimator:
>>> time_series = read_in_timeseries() # get the development of the shares >>> from tsfresh.transformers import FeatureAugmenter >>> augmenter = FeatureAugmenter() >>> augmenter.set_timeseries_container(time_series) >>> df_with_time_series_features = augmenter.transform(df)
The settings for the feature calculation can be controlled with the settings object. If you pass
None
, the default settings are used. Please refer toFeatureExtractionSettings
for more information.This estimator does not select the relevant features, but calculates and adds all of them to the DataFrame. See the
RelevantFeatureAugmenter
for calculating and selecting features.For a description what the parameters column_id, column_sort, column_kind and column_value mean, please see
extraction
.-
fit
(X=None, y=None)[source]¶ The fit function is not needed for this estimator. It just does nothing and is here for compatibility reasons.
Parameters: - X (Any) – Unneeded.
- y (Any) – Unneeded.
Returns: The estimator instance itself
Return type:
-
set_timeseries_container
(timeseries_container)[source]¶ Set the timeseries, with which the features will be calculated. For a format of the time series container, please refer to
extraction
. The timeseries must contain the same indices as the later DataFrame, to which the features will be added (the one you will pass totransform()
). You can call this function as often as you like, to change the timeseries later (e.g. if you want to extract for different ids).Parameters: timeseries_container (pandas.DataFrame or dict) – The timeseries as a pandas.DataFrame or a dict. See extraction
for the format.Returns: None Return type: None
-
transform
(X)[source]¶ Add the features calculated using the timeseries_container and add them to the corresponding rows in the input pandas.DataFrame X.
To save some computing time, you should only include those time serieses in the container, that you need. You can set the timeseries container with the method
set_timeseries_container()
.Parameters: X (pandas.DataFrame) – the DataFrame to which the calculated timeseries features will be added. This is not the dataframe with the timeseries itself. Returns: The input DataFrame, but with added features. Return type: pandas.DataFrame
-
class
tsfresh.transformers.feature_selector.
FeatureSelector
(settings=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Sklearn-compatible estimator, for reducing the number of features in a dataset to only those, that are relevant and significant to a given target. It is basically a wrapper around
check_fs_sig_bh()
.The check is done by testing the hypothesis
= the Feature is not relevant and can not be added`
against
= the Feature is relevant and should be kept
using several statistical tests (depending on whether the feature or/and the target is binary or not). Using the Benjamini Hochberg procedure, only features in
are rejected.
You can control how the significance tests are executed by handing in a settings object. Please refer to
FeatureSignificanceTestsSettings
for more information. If you do not pass a settings object, the defaults are used.This estimator - as most of the sklearn estimators - works in a two step procedure. First, it is fitted on training data, where the target is known:
>>> X_train, y_train = pd.DataFrame(), pd.Series() # fill in with your features and target >>> from tsfresh.transformers import FeatureSelector >>> selector = FeatureSelector() >>> selector.fit(X_train, y_train)
The estimator keeps track on those features, that were relevant in the training step. If you apply the estimator after the training, it will delete all other features in the testing data sample:
>>> X_test = pd.DataFrame() >>> X_selected = selector.transform(X_test)
After that, X_selected will only contain the features that were relevant during the training.
If you are interested in more information on the features, you can look into the member
relevant_features
after the fit.-
fit
(X, y)[source]¶ Extract the information, which of the features are relevent using the given target.
For more information, please see the
check_fs_sig_bh()
function. All columns in the input data sample are treated as feature. The index of all rows in X must be present in y.Parameters: - X (pandas.DataFrame or numpy.array) – data sample with the features, which will be classified as relevent or not
- y (pandas.Series or numpy.array) – target vecotr to be used, to classify the features
Returns: the fitted estimator with the information, which features are relevant
Return type:
-
transform
(X)[source]¶ Delete all features, which were not relevant in the fit phase.
Parameters: X (pandas.DataSeries or numpy.array) – data sample with all features, which will be reduced to only those that are relevant Returns: same data sample as X, but with only the relevant features Return type: pandas.DataFrame or numpy.array
-
-
class
tsfresh.transformers.relevant_feature_augmenter.
RelevantFeatureAugmenter
(evaluate_only_added_features=True, feature_selection_settings=None, feature_extraction_settings=None, column_id=None, column_sort=None, column_kind=None, column_value=None, timeseries_container=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Sklearn-compatible estimator to calculate relevant features out of a time series and add them to a data sample.
As many other sklearn estimators, this estimator works in two steps:
In the fit phase, all possible time series features are calculated using the time series, that is set by the set_timeseries_container function (if the features are not manually changed by handing in a feature_extraction_settings object). Then, their significance and relevance to the target is computed using statistical methods and only the relevant ones are selected using the Benjamini Hochberg procedure. These features are stored internally.
In the transform step, the information on which features are relevant from the fit step is used and those features are extracted from the time series. These extracted features are then added to the input data sample.
This estimator is a wrapper around most of the functionality in the tsfresh package. For more information on the subtasks, please refer to the single modules and functions, which are:
- Settings for the feature extraction:
FeatureExtractionSettings
- Feature extraction method:
extract_features()
- Extracted features:
feature_calculators
- Settings for the feature selection:
FeatureSignificanceTestsSettings
- Feature selection:
check_fs_sig_bh()
This estimator works quite analogues to the
FeatureAugmenter
with the difference that this estimator does only output and calculate the relevant features, whereas the other outputs all features.Also for this estimator, two datasets play a crucial role:
- the time series container with the timeseries data. This container (for the format see
extraction
) contains the data which is used for calculating the features. It must be groupable by ids which are used to identify which feature should be attached to which row in the second dataframe: - the input data, where the features will be added to.
Imagine the following situation: You want to classify 10 different financial shares and you have their development in the last year as a time series. You would then start by creating features from the metainformation of the shares, e.g. how long they were on the market etc. and filling up a table - the features of one stock in one row.
>>> # Fill in the information of the stocks and the target >>> X_train, X_test, y_train = pd.DataFrame(), pd.DataFrame(), pd.Series()
You can then extract all the relevant features from the time development of the shares, by using this estimator:
>>> train_time_series, test_time_series = read_in_timeseries() # get the development of the shares >>> from tsfresh.transformers import RelevantFeatureAugmenter >>> augmenter = RelevantFeatureAugmenter() >>> augmenter.set_timeseries_container(train_time_series) >>> augmenter.fit(X_train, y_train) >>> augmenter.set_timeseries_container(test_time_series) >>> X_test_with_features = augmenter.transform(X_test)
X_test_with_features will then contain the same information as X_test (with all the meta information you have probably added) plus some relevant time series features calculated on the time series you handed in.
Please keep in mind that the time series you hand in before fit or transform must contain data for the rows that are present in X.
If your set evaluate_only_added_features to True, your manually-created features that were present in X_train (or X_test) before using this estimator are not touched. Otherwise, also those features are evaluated and may be rejected from the data sample, because they are irrelevant.
For a description what the parameters column_id, column_sort, column_kind and column_value mean, please see
extraction
.You can control the feature extraction in the fit step (the feature extraction in the transform step is done automatically) as well as the feature selection in the fit step by handing in settings objects of the type
FeatureExtractionSettings
andFeatureSignificanceTestsSettings
. However, the default settings which are used if you pass no objects are often quite sensible.-
fit
(X, y)[source]¶ Use the given timeseries from
set_timeseries_container()
and calculate features from it and add them to the data sample X (which can contain other manually-designed features).Then determine which of the features of X are relevant for the given target y. Store those relevant features internally to only extract them in the transform step.
If evaluate_only_added_features is True, only reject newly, automatically added features. If it is False, also look at the features that are already present in the DataFrame.
Parameters: - X (pandas.DataFrame or numpy.array) – The data frame without the time series features. The index rows should be present in the timeseries and in the target vector.
- y (pandas.Series or numpy.array) – The target vector to define, which features are relevant.
Returns: the fitted estimator with the information, which features are relevant.
Return type:
-
set_timeseries_container
(timeseries_container)[source]¶ Set the timeseries, with which the features will be calculated. For a format of the time series container, please refer to
extraction
. The timeseries must contain the same indices as the later DataFrame, to which the features will be added (the one you will pass totransform()
orfit()
). You can call this function as often as you like, to change the timeseries later (e.g. if you want to extract for different ids).Parameters: timeseries_container (pandas.DataFrame or dict) – The timeseries as a pandas.DataFrame or a dict. See extraction
for the format.Returns: None Return type: None
-
transform
(X)[source]¶ After the fit step, it is known which features are relevant, Only extract those from the time series handed in with the function
set_timeseries_container()
.If evaluate_only_added_features is False, also delete the irrelevant, already present features in the data frame.
Parameters: X (pandas.DataFrame or numpy.array) – the data sample to add the relevant (and delete the irrelevant) features to. Returns: a data sample with the same information as X, but with added relevant time series features and deleted irrelevant information (only if evaluate_only_added_features is False). Return type: pandas.DataFrame
- Settings for the feature extraction:
The module transformers
contains several transformers which can be used inside a sklearn pipeline.
tsfresh.utilities package¶
Utility functions for handling the DataFrame conversions to the internal normalized format
(see normalize_input_to_internal_representation
) or on how to handle NaN
and inf
in the DataFrames.
-
tsfresh.utilities.dataframe_functions.
check_for_nans_in_columns
(df, columns=None)[source]¶ Helper function to check for
NaN
in the data frame and raise aValueError
if there is one.Parameters: - df (pandas.DataFrame) – the pandas DataFrame to test for NaNs
- columns (list) – a list of columns to test for NaNs. If left empty, all columns of the DataFrame will be tested.
Returns: None
Return type: Raise: ValueError
ofNaNs
are found in the DataFrame.
-
tsfresh.utilities.dataframe_functions.
get_range_values_per_column
(df)[source]¶ Retrieves the finite max, min and mean values per column in df and stores them in three dictionaries, each mapping from column name to value. If a column does not contain finite value, 0 is stored instead.
Parameters: df – Dataframe
Returns: Dictionaries mapping column names to max, min, mean values
-
tsfresh.utilities.dataframe_functions.
impute
(df_impute)[source]¶ Columnwise replaces all
NaNs
andinfs
from the DataFrame df_impute with average/extreme values from the same columns. This is done as follows: Each occurringinf
orNaN
in df_impute is replaced by-inf
->min
+inf
->max
NaN
->median
If the column does not contain finite values at all, it is filled with zeros.
This function modifies df_impute in place. After that, df_impute is guaranteed to not contain any non-finite values. Also, all columns will be guaranteed to be of type
np.float64
.Parameters: df_impute (pandas.DataFrame) – DataFrame to impute Returns: None Return type: None
-
tsfresh.utilities.dataframe_functions.
impute_dataframe_range
(df_impute, col_to_max=None, col_to_min=None, col_to_median=None)[source]¶ Columnwise replaces all
NaNs
andinfs
from the DataFrame df_impute with average/extreme values from the provided dictionaries. This is done as follows: Each occurringinf
orNaN
in df_impute is replaced by-inf
->min
+inf
->max
NaN
->median
If a column is not found in the one of the dictionaries, the values are calculated from the columns finite values. If the column does not contain finite values at all, it is filled with zeros.
This function modifies df_impute in place. Unless the dictionaries contain
NaNs
orinfs
, df_impute is guaranteed to not contain any non-finite values. Also, all columns will be guaranteed to be of typenp.float64
.Parameters: - df_impute (pandas.DataFrame) – DataFrame to impute
- col_to_max (dict) – Dictionary mapping column names to max values
- col_to_min – Dictionary mapping column names to min values
- col_to_median – Dictionary mapping column names to median values
-
tsfresh.utilities.dataframe_functions.
impute_dataframe_zero
(df_impute)[source]¶ Replaces all
NaNs
andinfs
from the DataFrame df_impute with 0s.df_impute will be modified in place. All its columns will be of datatype
np.float64
.Parameters: df_impute (pandas.DataFrame) – DataFrame to impute
-
tsfresh.utilities.dataframe_functions.
normalize_input_to_internal_representation
(df_or_dict, column_id, column_sort, column_kind, column_value)[source]¶ Try to transform any given input to the internal representation of time series, which is a mapping from string (the kind) to a pandas DataFrame with exactly two columns (the value and the id).
This function can transform pandas DataFrames in different formats or dictionaries to pandas DataFrames in different formats. It is used internally in the extract_features function and should not be called by the user.
Parameters: - df_or_dict (pandas.DataFrame or dict) – a pandas DataFrame or a dictionary. The required shape/form of the object depends on the rest of the passed arguments.
- column_id (basestring or None) – if not None, it must be present in the pandas DataFrame or in all DataFrames in the dictionary. It is not allowed to have NaN values in this column. If this column name is None, a new column will be added to the pandas DataFrame (or all pandas DataFrames in the dictionary) and the same id for all entries is assumed.
- column_sort (basestring or None) – if not None, sort the rows by this column. Then, the column is dropped. It is not allowed to have NaN values in this column.
- column_kind (basestring or None) – It can only be used when passing a pandas DataFrame (the dictionary is already assumed to be grouped by the kind). Is must be present in the DataFrame and no NaN values are allowed. The DataFrame will be grouped by the values in the kind column and each grouped will be one entry in the resulting mapping. If the kind column is not passed, it is assumed that each column in the pandas DataFrame (except the id or sort column) is a possible kind and the DataFrame is split up into as many DataFrames as there are columns. Except when a value column is given: then it is assumed that there is only one column.
- column_value (basestring or None) – If it is given, it must be present and not-NaN on the pandas DataFrames (or all pandas DataFrames in the dictionaries). If it is None, it is assumed that there is only a single remaining column in the DataFrame(s) (otherwise an exception is raised).
Returns: A tuple of 3 elements: the normalized DataFrame as a dictionary mapping from the kind (as a string) to the corresponding DataFrame, the name of the id column and the name of the value column
Return type: (dict, basestring, basestring)
Raise: ValueError
when the passed combination of parameters is wrong or does not fit to the input DataFrame or dict.
-
tsfresh.utilities.dataframe_functions.
restrict_input_to_index
(df_or_dict, column_id, index)[source]¶ Restrict df_or_dict to those ids contained in index.
Parameters: - df_or_dict (pandas.DataFrame or dict) – a pandas DataFrame or a dictionary.
- column_id (basestring) – it must be present in the pandas DataFrame or in all DataFrames in the dictionary. It is not allowed to have NaN values in this column.
- index (Iterable or pandas.Series) – Index containing the ids
Returns: the restricted df_or_dict
Return type: Raise: TypeError
if df_or_dict is not of type dict or pandas.DataFrame
Contains methods to start and stop the profiler that checks the runtime of the different feature calculators
-
tsfresh.utilities.profiling.
end_profiling
(profiler, filename, sorting=None)[source]¶ Helper function to stop the profiling process and write out the profiled data into the given filename. Before this, sort the stats by the passed sorting.
Parameters: - profiler (cProfile.Profile) – An already started profiler (probably by start_profiling).
- filename (basestring) – The name of the output file to save the profile.
- sorting (basestring) – The sorting of the statistics passed to the sort_stats function.
Returns: None
Return type: Start and stop the profiler with:
>>> profiler = start_profiling() >>> # Do something you want to profile >>> end_profiling(profiler, "out.txt", "cumulative")
-
tsfresh.utilities.profiling.
start_profiling
()[source]¶ Helper function to start the profiling process and return the profiler (to close it later).
Returns: a started profiler. Return type: cProfile.Profile Start and stop the profiler with:
>>> profiler = start_profiling() >>> # Do something you want to profile >>> end_profiling(profiler, "cumulative", "out.txt")
Module contents¶
At the top level we export the three most important submodules of tsfresh, which are:
extract_features
select_features
extract_relevant_features
Data Formats¶
tsfresh offers three different options to specify the time series data to be used in the tsfresh.extract_features()
function. Irrespective of the input format, tsfresh will always return the calculated features in the same output format.
All three input format options consist of pandas.DataFrame
objects. There are four important column types that
make up those DataFrames:
Mandatory
column_id: | This column indicates which entities the time series belong to. Features will be extracted individually for each entity. The resulting feature matrix will contain one row per entity. |
---|---|
column_value: | This column contains the actual values of the time series. |
Optional (but strongly recommended to specify)
column_sort: | This column contains values which allow to sort the time series (e.g. time stamps). It is not required to have equidistant time steps or the same time scale for the different ids and/or kinds. If you omit this column, the DataFrame is assumed to be already sorted in increasing order. |
---|
Optional
column_kind: | This column indicates the names of the different time series types (E.g. different sensors in an industrial application). For each kind of time series the features are calculated individually. |
---|
Important: None of these columns is allowed to contain any NaN
, Inf
or -Inf
values.
- Now there are three slightly different input formats for the time series data:
- A flat DataFrame
- A stacked DataFrame
- A dictionary of flat DataFrames
The difference between a flat and a stacked DataFrame is indicated by specifying or not specifying the parameters column_value and column_kind in the extract_features function.
Input Option 1. Flat DataFrame¶
If both column_value and column_kind are set to None
, the time series data is assumed to be in a flat
DataFrame. This means that each different time series is saved as its own column.
Example: Imagine you record the values of time series x and y for different objects A and B for three different times t1, t2 and t3. Now you want to calculate some feature with tsfresh. Your resulting DataFrame have to look like this:
id | time | x | y |
---|---|---|---|
A | t1 | x(A, t1) | y(A, t1) |
A | t2 | x(A, t2) | y(A, t2) |
A | t3 | x(A, t3) | y(A, t3) |
B | t1 | x(B, t1) | y(B, t1) |
B | t2 | x(B, t2) | y(B, t2) |
B | t3 | x(B, t3) | y(B, t3) |
and you would pass
column_id="id", column_sort="time", column_kind=None, column_value=None
to the extraction functions.
Input Option 2. Stacked DataFrame¶
If both column_value and column_kind are set, the time series data is assumed to be a stacked DataFrame. This means that there are no different columns for the different type of time series. This representation has several advantages over the flat Data Frame. For example, the time stamps of the different time series do not have to align.
It does not contain different columns for the different types of time series but only one value column and a kind column:
id | time | kind | value |
---|---|---|---|
A | t1 | x | x(A, t1) |
A | t2 | x | x(A, t2) |
A | t3 | x | x(A, t3) |
A | t1 | y | y(A, t1) |
A | t2 | y | y(A, t2) |
A | t3 | y | y(A, t3) |
B | t1 | x | x(B, t1) |
B | t2 | x | x(B, t2) |
B | t3 | x | x(B, t3) |
B | t1 | y | y(B, t1) |
B | t2 | y | y(B, t2) |
B | t3 | y | y(B, t3) |
Then you would set
column_id="id", column_sort="time", column_kind="kind", column_value="value"
Input Option 3. Dictionary of flat DataFrames¶
Instead of passing a DataFrame which must be split up by its different kinds, you can also give a dictionary mapping from the kind as string to a DataFrame containing only the time series data of that kind. So essentially you are using a singular DataFrame for each kind of time series.
The data from the example can be split into two DataFrames resulting in the following dictionary
{ “x”:
id time value A t1 x(A, t1) A t2 x(A, t2) A t3 x(A, t3) B t1 x(B, t1) B t2 x(B, t2) B t3 x(B, t3)
, “y”:
id time value A t1 y(A, t1) A t2 y(A, t2) A t3 y(A, t3) B t1 y(B, t1) B t2 y(B, t2) B t3 y(B, t3)
}
tsfresh would be passed this dictionary and the following arguments
column_id="id", column_sort="time", column_kind=None, column_value="value":
In this case we do not need to specify the kind column as the kind is the respective dictionary key.
Output Format¶
The resulting feature matrix for all three input options will be the same.
It will always be a pandas.DataFrame
with the following layout
id | x_feature_1 | ... | x_feature_N | y_feature_1 | ... | y_feature_N |
---|---|---|---|---|---|---|
A | ... | ... | ... | ... | ... | ... |
B | ... | ... | ... | ... | ... | ... |
where the x features are calculated using all x values (independently for A and B), y features using all y values and so on.
scikit-learn Transformers¶
tsfresh includes three scikit-learn compatible transformers. You can easily add them to your existing data science pipeline. If you are not familiar with scikit-learn’s pipeline we recommend you take a look at the official documentation [1].
The purpose of such a pipeline is to assemble several preprocessing steps that can be cross-validated together while setting different parameters. Our tsfresh transformer allows you to extract and filter the time series features during such a preprocessing sequence.
The first two estimator contained in tsfresh are the FeatureAugmenter
,
which extracts the features, and the FeatureSelector
, which only
performs the feature selection algorithm.
It is preferable to combine extracting and filtering of the features in a single step to avoid unnecessary feature
calculations.
Hence, we have the RelevantFeatureAugmenter
, which combines both the
extraction and filtering of the features in a single step.
Example¶
In the following example you see how we combine tsfresh’s RelevantFeatureAugmenter
and a RandomForestClassifier
into a single pipeline.
This pipeline can then fit both our transformer and the classifier in one step.
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from tsfresh.examples import load_robot_execution_failures
from tsfresh.transformers import RelevantFeatureAugmenter
pipeline = Pipeline([('augmenter', RelevantFeatureAugmenter(column_id='id', column_sort='time')),
('classifier', RandomForestClassifier())])
df_ts, y = load_robot_execution_failures()
X = pd.DataFrame(index=y.index)
pipeline.set_params(augmenter__timeseries_container=df_ts)
pipeline.fit(X, y)
The parameters of the augment transformer correspond to the parameters of the top-level convenience function
extract_relevant_features()
.
In the example, we only set the names of two columns column_id='id'
, column_sort='time'
(see Data Formats for an explanation of those parameters).
Because we can not pass the time series container directly as a parameter to the augmenter step when calling fit or
transform on a sklearn.pipeline.Pipeline
we have to set it manually by calling
pipeline.set_params(augmenter__timeseries_container=df_ts)
.
In general, you can change the time series container from which the features are extracted by calling either the
pipeline’s set_params()
method or the transformers
set_timeseries_container()
method.
For further examples, see the Jupyter Notebook pipeline_example.ipynb in the notebooks folder of the tsfresh package.
Feature Calculation¶
Overview on extracted feature¶
tsfresh already calculates a comprehensive number of features. If you are interested which features are calculated just go to our
tsfresh.feature_extraction.feature_calculators
module. You will find the documentation of every calculated feature there.
Feature naming¶
tsfresh enforces a strict naming of the created features, which you have to follow whenever you create new feature
calculators.
This is due to the tsfresh.feature_extraction.FeatureExtractionSettings.from_columns()
method which needs to
deduce the following information from the feature name
- the time series that was used to calculate the feature
- the feature calculator method that was used to derive the feature
- all parameters that have been used to calculate the feature (optional)
Hence, to enable the tsfresh.feature_extraction.FeatureExtractionSettings.from_columns()
to deduce all the
necessary conditions, the features will be named in the following format
{time_series_name}__{feature_name}__{parameter name 1}_{parameter value 1}__[..]__{parameter name k}_{parameter value k}
(Here we assumed that {feature_name} has k parameters).
Examples for feature naming¶
So for example the following feature name
temperature_1__quantile__q_0.6
is the value of the feature tsfresh.feature_extraction.feature_calculators.quantile()
for the time series
`temperature_1`
and a parameter value of q=0.6
. On the other hand, the feature named
Pressure 5__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_14__w_5
denotes the value of the feature tsfresh.feature_extraction.feature_calculators.cwt_coefficients()
for
the time series `Pressure 5`
under parameter values of widths=(2, 5, 10, 20)
, coeff=14
and w=5
.
Feature filtering¶
The all-relevant problem of feature selection is the identification of all strongly and weakly relevant attributes. This problem is especially hard to solve for time series classification and regression in industrial applications such as predictive maintenance or production line optimization, for which each label or regression target is associated with several time series and meta-information simultaneously.
To limit the number of irrelevant features, tsfresh deploys the fresh algorithm (fresh stands for FeatuRe Extraction based on Scalable Hypothesis tests) [1].
The algorithm is called by tsfresh.feature_selection.feature_selector.check_fs_sig_bh()
.
It is a efficient, scalable feature extraction algorithm, which filters the available features in an early stage of the
machine learning pipeline with respect to their significance for the classification or regression task, while
controlling the expected percentage of selected but irrelevant features.
The filtering process consists of three phases which are sketched in the following figure:

Phase 1 - Feature extraction¶
Firstly, the algorithm characterizes time series with comprehensive and well-established feature mappings and considers
additional features describing meta-information.
The feature calculators used to derive the features are contained in tsfresh.feature_extraction.feature_calculators
.
In the figure from above, this corresponds to the change from raw time series to aggregated features.
Phase 2 - Feature significance testing¶
In a second step, each feature vector is individually and independently evaluated with respect to its significance for
predicting the target under investigation.
Those tests are contained in the submodule tsfresh.feature_selection.significance_tests
.
The result of these tests is a vector of p-values, quantifying the significance of each feature for predicting the
label/target.
In the figure from above, this corresponds to the change from aggregated features to p-values.
Phase 3 - Multiple test procedure¶
The vector of p-values is evaluated on basis of the Benjamini-Yekutieli procedure [2] in order to decide which features
to keep.
This multiple testing procedure is contained in the submodule tsfresh.feature_selection.feature_selector
.
In the figure from above, this corresponds to the change from p-values to selected features.
References¶
[1] Christ, M., Kempa-Liehr, A.W. and Feindt, M. (2016). Distributed and parallel time series feature extraction for industrial big data applications. ArXiv e-prints: 1610.07717 URL: http://adsabs.harvard.edu/abs/2016arXiv161007717C
[2] Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165–1188
How to add a custom feature¶
It may be beneficial to add a custom feature to those that are calculated by tsfresh. To do so, one has to adapt certain steps:
Step 1. Decide which type of feature you want to implement¶
In tsfresh we differentiate between three types of feature calculation methods
1. aggregate features without parameter
2. aggregate features with parameter
3. apply features with parameters
So if you want to add a singular feature with out any parameters, stick with 1., the aggregate feature without parameters.
Then, if your features can be calculated independently for each possible parameter set, stick with type 2., the aggregate features with parameters.
If both cases from above do not apply, because it is beneficial to calculate the features for the different parameter settings at the same time (to e.g. perform auxiliary calculations only once for all features), stick with type 3., the apply features with parameters.
Step 2. Write the feature calculator¶
Depending on which type of feature you are implementing, you can use the following feature calculator skeletons:
1. aggregate features without parameter
@set_property("fctype", "aggregate")
def your_feature_calculator(x):
"""
The description of your feature
:param x: the time series to calculate the feature of
:type x: pandas.Series
:return: the value of this feature
:return type: bool or float
"""
# Calculation of feature as float, int or bool
f = f(x)
return f
2. aggregate features with parameter
@set_property("fctype", "aggregate_with_parameters")
def your_feature_calculator(x, p1, p2, ...):
"""
Description of your feature
:param x: the time series to calculate the feature of
:type x: pandas.Series
:param p1: description of your parameter p1
:type p1: type of your parameter p1
:param p2: description of your parameter p2
:type p2: type of your parameter p2
...
:return: the value of this feature
:return type: bool or float
"""
# Calculation of feature as float, int or bool
f = f(x)
return f
3. apply features with parameters
@set_property("fctype", "apply")
def your_feature_calculator(x, c, param):
"""
Description of your feature
:param x: the time series to calculate the feature of
:type x: pandas.Series
:param c: the time series name
:type c: str
:param param: contains dictionaries {"p1": x, "p2": y, ...} with p1 float, p2 int ...
:type param: list
:return: the different feature values
:return type: pandas.Series
"""
# Calculation of feature as pandas.Series s, the index is the name of the feature
s = f(x)
return s
After implementing the feature calculator, please add it to the tsfresh.feature_extraction.feature_calculators
submodule. tsfresh will only find feature calculators that are in this submodule.
Step 3. Add custom settings for your feature¶
Finally, you have to add custom settings if your feature is a apply or aggregate feature with parameters. To do so,
just append your parameters to the name_to_param
dictionary inside the
tsfresh.feature_extraction.settings.FeatureExtractionSettings
constructor:
name_to_param.update({
# here are the existing settings
...
# Now the settings of your feature calculator
"your_feature_calculator" = [{"p1": x, "p2": y, ...} for x,y in ...],
})
That is it, tsfresh will calculate your feature the next time you run it.
FAQ¶
- Does tsfresh supports different time series lengths? Yes, it supports different time series lenghts. However, some feature calculators could demand a minimal length of the time series. If a shorter time series is passed to the calculator, normally a NaN is returned.
Authors¶
Development Lead¶
- Maximilian Christ (maximilianchrist.com, max.christ@me.com)
Contributions¶
- Nils Braun (nilslennartbraun@gmail.com)
- Julius Neuffer (julius.neuffer@blue-yonder.com)
License¶
MIT LICENCE
Copyright (c) 2016 Maximilian Christ, Blue Yonder GmbH
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Changelog¶
tsfresh uses Semantic Versioning
Version 0.3.1¶
- fixed several bugs: closing multiprocessing pools / index out of range cwt calculator / division by 0 in index_mass_quantile
- now all warnings are disabled by default
- for a singular type time series data, the name of value column is used as feature prefix
Version 0.3.0¶
- fixed bug with parsing of “NUMBER_OF_CPUS” environment variable
- now features are calculated in parallel for each type
Version 0.2.0¶
- now p-values are calculated in parallel
- fixed bugs for constant features
- allow time series columns to be named 0
- moved uci repository datasets to github mirror
- added feature caculator sampleentropy
- added minimalfeaturesignificance extraction settings
- fixed bug in calculation of fourier coefficients
Version 0.1.2¶
- added support for python 3.5.2
- fixed bug with the naming of the features that made the naming of features non-deterministic
Version 0.1.1¶
- mainly fixes for the read-the-docs documentation, the pypi readme and so on
Version 0.1.0¶
- Initial version :)
How to contribute¶
We want tsfresh to become the biggest archive of feature extraction methods in python. To achieve this goal, we need your help!
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome. If you want to add one or two interesting feature calculators, implement a new feature selection process or just fix 1-2 typos, your help is appreciated.
If you want to help, just create a pull request on our github page. To the new user, working with Git can sometimes be confusing and frustrating. If you are not familiar with Git you can also contact us by email.
Guidelines¶
There are three general coding paradigms that we believe in:
- Keep it simple. We believe that “Programs should be written for people to read, and only incidentally for machines to execute.”.
- Keep it documented by at least including a docstring for each method and class. Do not describe what you are doing but why you are doing it.
- Keep it tested. We aim for a high test coverage.
There are two important copyright guidelines:
- Please do include any data sets for which a licence is not available or commercial use is even prohibited. Those can undermine the licence of the whole projects.
- Do not use code snippets for which a licence is not available (e.g. from stackoverflow) or commercial use is even prohibited. Those can undermine the licence of the whole projects.
Further, there are some technical decisions we made:
- Clear the Output of iPython notebooks. This improves the readability of related Git diffs.
Testing setup¶
After making your changes, you probably want to test your changes locally. To run our comprehensive suit of unit tests you have to install all the relevant python packages with
cd /path/to/tsfresh
pip install -r requirements.txt
pip install -r rdocs-requirements.txt
pip install -r test-requirements.txt
pip install -e .
The last command will dynamically link the tsfresh package which means that changes to the code will directly show up for example in your test run.
Then, if you have everything installed, you can run the tests with
python setup.py test
or build the documentation with
python setup.py docs
The finished documentation can be found in the docs/_build/html folder.
On Github we use a Travis CI Folder that runs our test suite every time a commit or pull request is sent. The configuration of Travi is controlled by the .travis.yml file.
We are looking forward to hear from you! =)