Introduction

Why tsfresh?

tsfresh is used for systematic feature engineering from time-series and other sequential data [1]. These data have in common that they are ordered by an independent variable. The most common independent variable is time (time series). Other examples for sequential data are reflectance and absorption spectra, which have wavelength as their ordering dimension. In order to keep things simple, we are simply referring to all different types of sequential data as time-series.

the time series

(and yes, it is pretty cold!)

Now you want to calculate different characteristics such as the maximum or minimum temperature, the average temperature or the number of temporary temperature peaks:

some characteristics of the time series

Without tsfresh, you would have to calculate all those characteristics manually; tsfresh automates this process calculating and returning all those features automatically.

In addition, tsfresh is compatible with the Python libraries pandas and scikit-learn, so you can easily integrate the feature extraction with your current routines.

What can we do with these features?

The extracted features can be used to describe the time series, i.e., often these features give new insights into the time series and their dynamics. They can also be used to cluster time series and to train machine learning models that perform classification or regression tasks on time series.

The tsfresh package has been successfully used in the following projects:

  • prediction of steel billets quality during a continuous casting process [2],

  • activity recognition from synchronized sensors [3],

  • volcanic eruption forecasting [4],

  • authorship attribution from written text samples [5],

  • characterisation of extrasolar planetary systems from time-series with missing data [6],

  • sensor anomaly detection [7],

  • and many many more.

What can’t we do with tsfresh?

Currently, tsfresh is not suitable:

  • for streaming data (by streaming data we mean data that is usually used for online operations, while time series data is usually used for offline operations)

  • to train models on the extracted features (we do not want to reinvent the wheel, to train machine learning models check out the Python package scikit-learn)

  • for usage with highly irregular time series; tsfresh uses timestamps only to order observations, while many features are interval-agnostic (e.g., number of peaks) and can be determined for any series, some otherfeatures (e.g., linear trend) assume equal spacing in time, and should be used with care when this assumption is not met.

However, some of these use cases could be implemented, if you have an application in mind, open an issue at https://github.com/blue-yonder/tsfresh/issues, or feel free to contact us.

What else is out there?

There is a matlab package called hctsa which can be used to automatically extract features from time series. It is also possible to use hctsa from within Python through the pyopy package. Other available packagers are featuretools, FATS and cesium.

References