tsfresh.convenience package

Submodules

tsfresh.convenience.bindings module

tsfresh.convenience.bindings.dask_feature_extraction_on_chunk(df, column_id, column_kind, column_value, column_sort=None, default_fc_parameters=None, kind_to_fc_parameters=None)[source]

Extract features on a grouped dask dataframe given the column names and the extraction settings. This wrapper function should only be used if you have a dask dataframe as input. All format handling (input and output) needs to be done before or after that.

Examples

For example if you want to extract features on the robot example dataframe (stored as csv):

Import statements:

>>>  from dask import dataframe as dd
>>>  from tsfresh.convenience.bindings import dask_feature_extraction_on_chunk
>>>  from tsfresh.feature_extraction.settings import MinimalFCParameters

Read in the data

>>>  df = dd.read_csv("robot.csv")

Prepare the data into correct format. The format needs to be a grouped dataframe (grouped by time series id and feature kind), where each group chunk consists of a dataframe with exactly 4 columns: column_id, column_kind, column_sort and column_value. You can find the description of the columns in Data Formats. Please note: for this function to work you need to have all columns present! If necessary create the columns and fill them with dummy values.

>>>  df = df.melt(id_vars=["id", "time"],
...               value_vars=["F_x", "F_y", "F_z", "T_x", "T_y", "T_z"],
...               var_name="kind", value_name="value")
>>>  df_grouped = df.groupby(["id", "kind"])

Call the feature extraction

>>>  features = dask_feature_extraction_on_chunk(df_grouped, column_id="id", column_kind="kind",
...                                              column_sort="time", column_value="value",
...                                              default_fc_parameters=MinimalFCParameters())

Write out the data in a tabular format

>>>  features = features.categorize(columns=["variable"])
>>>  features = features.reset_index(drop=True) \
...                 .pivot_table(index="id", columns="variable", values="value", aggfunc="mean")
>>>  features.to_csv("output")
Parameters:
  • df (dask.dataframe.groupby.DataFrameGroupBy) – A dask dataframe grouped by id and kind.
  • default_fc_parameters (dict) – mapping from feature calculator names to parameters. Only those names which are keys in this dict will be calculated. See the class:ComprehensiveFCParameters for more information.
  • kind_to_fc_parameters (dict) – mapping from kind names to objects of the same type as the ones for default_fc_parameters. If you put a kind as a key here, the fc_parameters object (which is the value), will be used instead of the default_fc_parameters. This means that kinds, for which kind_of_fc_parameters doe not have any entries, will be ignored by the feature selection.
  • column_id (str) – The name of the id column to group by.
  • column_sort (str or None) – The name of the sort column.
  • column_kind (str) – The name of the column keeping record on the kind of the value.
  • column_value (str) – The name for the column keeping the value itself.
Returns:

A dask dataframe with the columns column_id, “variable” and “value”. The index is taken from the grouped dataframe.

Return type:

dask.dataframe.DataFrame (id int64, variable object, value float64)

tsfresh.convenience.bindings.spark_feature_extraction_on_chunk(df, column_id, column_kind, column_value, column_sort=None, default_fc_parameters=None, kind_to_fc_parameters=None)[source]

Extract features on a grouped spark dataframe given the column names and the extraction settings. This wrapper function should only be used if you have a spark dataframe as input. All format handling (input and output) needs to be done before or after that.

Examples

For example if you want to extract features on the robot example dataframe (stored as csv):

Import statements:

>>>  from tsfresh.convenience.bindings import spark_feature_extraction_on_chunk
>>>  from tsfresh.feature_extraction.settings import MinimalFCParameters

Read in the data

>>>  df = spark.read(...)

Prepare the data into correct format. The format needs to be a grouped dataframe (grouped by time series id and feature kind), where each group chunk consists of a dataframe with exactly 4 columns: column_id, column_kind, column_sort and column_value. You can find the description of the columns in Data Formats. Please note: for this function to work you need to have all columns present! If necessary create the columns and fill them with dummy values.

>>>  df = ...
>>>  df_grouped = df.groupby(["id", "kind"])

Call the feature extraction

>>>  features = spark_feature_extraction_on_chunk(df_grouped, column_id="id", column_kind="kind",
...                                               column_sort="time", column_value="value",
...                                               default_fc_parameters=MinimalFCParameters())

Write out the data in a tabular format

>>>  features = features.groupby("id").pivot("variable").sum("value")
>>>  features.write.csv("output")
Parameters:
  • df (pyspark.sql.group.GroupedData) – A spark dataframe grouped by id and kind.
  • default_fc_parameters (dict) – mapping from feature calculator names to parameters. Only those names which are keys in this dict will be calculated. See the class:ComprehensiveFCParameters for more information.
  • kind_to_fc_parameters (dict) – mapping from kind names to objects of the same type as the ones for default_fc_parameters. If you put a kind as a key here, the fc_parameters object (which is the value), will be used instead of the default_fc_parameters. This means that kinds, for which kind_of_fc_parameters doe not have any entries, will be ignored by the feature selection.
  • column_id (str) – The name of the id column to group by.
  • column_sort (str or None) – The name of the sort column.
  • column_kind (str) – The name of the column keeping record on the kind of the value.
  • column_value (str) – The name for the column keeping the value itself.
Returns:

A dask dataframe with the columns column_id, “variable” and “value”.

Return type:

pyspark.sql.DataFrame[id: bigint, variable: string, value: double]

tsfresh.convenience.relevant_extraction module

tsfresh.convenience.relevant_extraction.extract_relevant_features(timeseries_container, y, X=None, default_fc_parameters=None, kind_to_fc_parameters=None, column_id=None, column_sort=None, column_kind=None, column_value=None, show_warnings=False, disable_progressbar=False, profile=False, profiling_filename='profile.txt', profiling_sorting='cumulative', test_for_binary_target_binary_feature='fisher', test_for_binary_target_real_feature='mann', test_for_real_target_binary_feature='mann', test_for_real_target_real_feature='kendall', fdr_level=0.05, hypotheses_independent=False, n_jobs=1, distributor=None, chunksize=None, ml_task='auto')[source]

High level convenience function to extract time series features from timeseries_container. Then return feature matrix X possibly augmented with relevant features with respect to target vector y.

For more details see the documentation of extract_features() and select_features().

Examples

>>> from tsfresh.examples import load_robot_execution_failures
>>> from tsfresh import extract_relevant_features
>>> df, y = load_robot_execution_failures()
>>> X = extract_relevant_features(df, y, column_id='id', column_sort='time')
Parameters:
  • timeseries_container – The pandas.DataFrame with the time series to compute the features for, or a dictionary of pandas.DataFrames. See extract_features().
  • X (pandas.DataFrame) – A DataFrame containing additional features
  • y (pandas.Series) – The target vector
  • default_fc_parameters (dict) – mapping from feature calculator names to parameters. Only those names which are keys in this dict will be calculated. See the class:ComprehensiveFCParameters for more information.
  • kind_to_fc_parameters (dict) – mapping from kind names to objects of the same type as the ones for default_fc_parameters. If you put a kind as a key here, the fc_parameters object (which is the value), will be used instead of the default_fc_parameters.
  • column_id (str) – The name of the id column to group by. Please see Data Formats.
  • column_sort (str) – The name of the sort column. Please see Data Formats.
  • column_kind (str) – The name of the column keeping record on the kind of the value. Please see Data Formats.
  • column_value (str) – The name for the column keeping the value itself. Please see Data Formats.
  • chunksize (None or int) – The size of one chunk that is submitted to the worker process for the parallelisation. Where one chunk is defined as a singular time series for one id and one kind. If you set the chunksize to 10, then it means that one task is to calculate all features for 10 time series. If it is set it to None, depending on distributor, heuristics are used to find the optimal chunksize. If you get out of memory exceptions, you can try it with the dask distributor and a smaller chunksize.
  • n_jobs (int) – The number of processes to use for parallelization. If zero, no parallelization is used.
  • distributor (class) – Advanced parameter: set this to a class name that you want to use as a distributor. See the utilities/distribution.py for more information. Leave to None, if you want TSFresh to choose the best distributor.
  • show_warnings (bool) – Show warnings during the feature extraction (needed for debugging of calculators).
  • disable_progressbar (bool) – Do not show a progressbar while doing the calculation.
  • profile (bool) – Turn on profiling during feature extraction
  • profiling_sorting (basestring) – How to sort the profiling results (see the documentation of the profiling package for more information)
  • profiling_filename (basestring) – Where to save the profiling results.
  • test_for_binary_target_binary_feature (str) – Which test to be used for binary target, binary feature (currently unused)
  • test_for_binary_target_real_feature (str) – Which test to be used for binary target, real feature
  • test_for_real_target_binary_feature (str) – Which test to be used for real target, binary feature (currently unused)
  • test_for_real_target_real_feature (str) – Which test to be used for real target, real feature (currently unused)
  • fdr_level (float) – The FDR level that should be respected, this is the theoretical expected percentage of irrelevant features among all created features.
  • hypotheses_independent (bool) – Can the significance of the features be assumed to be independent? Normally, this should be set to False as the features are never independent (e.g. mean and median)
  • ml_task (str) – The intended machine learning task. Either ‘classification’, ‘regression’ or ‘auto’. Defaults to ‘auto’, meaning the intended task is inferred from y. If y has a boolean, integer or object dtype, the task is assumend to be classification, else regression.
Returns:

Feature matrix X, possibly extended with relevant time series features.

Module contents

The convenience submodule contains methods that allow the user to extract and filter features conveniently.