Large Input Data

If you are dealing with large time series data, you are facing multiple problems. The two most important ones are

  • long execution times for feature extraction
  • large memory consumptions, even beyond what a single machine can handle

To solve only the first problem, you can parallelize the computation as described in Parallelization. Please note, that parallelization on your local computer is already turned on by default.

However, for even larger data you need to handle both problems at once. You have multiple possibilities here:

Dask - the simple way

tsfresh accepts a dask dataframe instead of a pandas dataframe as input for the tsfresh.extract_features() function. Dask dataframes allow you to scale your computation beyond your local memory (via partitioning the data internally) and even to large clusters of machines. Its dataframe API is very similar to pandas dataframes and might even be a drop-in replacement.

All arguments discussed in Data Formats are also valid for the dask case. The input data will be transformed into the correct format for tsfresh using dask methods and the feature extraction will be added as additional computations to the computation graph. You can then add additional computations to the result or trigger the computation as usual with .compute().

Note

The last step of the feature extraction is to bring all features into a tabular format. Especially for very large data samples, this computation can be a large performance bottleneck. We therefore recommend to turn the pivoting off, if you do not really need it and work with the unpivoted data as much as possible.

For example, to read in data from parquet and do the feature extraction:

import dask.dataframe as dd
from tsfresh import extract_features

df = dd.read_parquet(...)

X = extract_features(df,
                     column_id="id", column_sort="time",
                     pivot=False)

result = X.compute()

Dask - more control

The feature extraction method needs to perform some data transformations, before it can call the actual feature calculators. If you want to optimize your data flow, you might want to have more control on how exactly the feature calculation is added to you dask computation graph.

Therefore, it is also possible to add the feature extraction directly:

from tsfresh.convenience.bindings import dask_feature_extraction_on_chunk
features = dask_feature_extraction_on_chunk(df_grouped,
                                            column_id="id",
                                            column_kind="kind",
                                            column_sort="time",
                                            column_value="value")

In this case however, df_grouped must already be in the correct format. Check out the documentation of tsfresh.convenience.bindings.dask_feature_extraction_on_chunk() for more information. No pivoting will be performed in this case.

PySpark

Similar to dask, it is also possible to pass the feature extraction into a Spark computation graph. You can find more information in the documentation of tsfresh.convenience.bindings.spark_feature_extraction_on_chunk().