Quickstart: Timeseries Feature Extraction

Some data sets are big enough and structurally suitable to apply machine learning methods. Timeseries data, however can not be fed into most machine learning algorithms directly.

With bletl.features, you can apply a mix of biologically inspired and statistical methods to extract hundreds of features from timeseries of backscatter, pH and DO.

Under the hood, bletl.features uses `tsfresh <https://tsfresh.readthedocs.io>`__ and combines it with an extendable API that you may use to provide additional custom designed feature extraction methods.

[1]:

import pandas
import pathlib
from IPython.display import display

import bletl
from bletl import features

Parse the raw data file

[2]:

filepath = pathlib.Path(r"..\tests\data\BL1\NT_1200rpm_30C_DO-GFP75-pH-BS10_12min_20171221_121339.csv")
bldata = bletl.parse(filepath, lot_number=1515, temp=30)

Feature Extraction

You’ll need to provide a list of Extractor objects for each filterset you want to extract from.

Additionally, you can specify the last_cycle after which the timeseries will be ignored, for example because of sacrifice sampling.

[3]:

extractors = {
    "BS10" : [features.BSFeatureExtractor(), features.StatisticalFeatureExtractor(), features.TSFreshExtractor()],
    "pH" : [features.pHFeatureExtractor(), features.StatisticalFeatureExtractor(), features.TSFreshExtractor()],
    "DO" : [features.DOFeatureExtractor(), features.StatisticalFeatureExtractor(), features.TSFreshExtractor()],
}
last_cycles = {
    "A01" : 20,
    "B01" : 50
}

The feature extraction itself takes a while. In this case roughly 3 minutes for all 48 wells.

[4]:

extracted_features = features.from_bldata(
    bldata=bldata,
    extractors=extractors,
    last_cycles=last_cycles
)

100.00% [48/48 02:31<00:00]

Feature Extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00,  3.97it/s]
Feature Extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00,  3.84it/s]
Feature Extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00,  3.98it/s]

Show extracted data

The extracted data is a big DataFrame, indexed by well ID. Each column starts with the name of the filterset from which the data was analyzed, followed by a double underscore and the name of the extracted feature.

For tsfresh-derived features, you’ll have to look up the meaning of the features in their documentation.

[5]:

extracted_features.head()

[5]:

	BS10__inflection_point_t	BS10__inflection_point_y	BS10__mue_median	BS10__max	BS10__mean	BS10__median	BS10__min	BS10__span	BS10__stan_dev	BS10__time_max	...	DO_x__fourier_entropy__bins_2	DO_x__fourier_entropy__bins_3	DO_x__fourier_entropy__bins_5	DO_x__fourier_entropy__bins_10	DO_x__fourier_entropy__bins_100	DO_x__permutation_entropy__dimension_3__tau_1	DO_x__permutation_entropy__dimension_4__tau_1	DO_x__permutation_entropy__dimension_5__tau_1	DO_x__permutation_entropy__dimension_6__tau_1	DO_x__permutation_entropy__dimension_7__tau_1
A01	15.502713	3.93831	1.533980	15.589340	13.236223	12.942862	11.786574	3.802765	1.135330	3.93831	...	0.304636	0.600166	1.033562	1.468140	2.397895	1.613392	2.196756	2.566599	2.615631	2.639057
A02	114.049172	22.02844	0.512476	114.049172	43.789110	37.938058	10.662952	103.386220	27.608253	22.02844	...	0.220570	0.257292	0.474328	0.474328	1.781950	1.728323	2.901951	3.792271	4.185139	4.454983
A03	158.104604	22.02891	0.013302	158.104604	52.730195	42.885822	11.259160	146.845445	39.335875	22.02891	...	0.220570	0.314446	0.566976	0.750980	1.917544	1.574925	2.636769	3.491505	4.107095	4.405701
A04	148.610349	21.13505	0.153967	169.490066	52.947190	40.735298	11.072086	158.417980	41.251284	22.02937	...	0.220570	0.257292	0.417984	0.603698	1.806336	1.593080	2.736457	3.516927	4.024063	4.280164
A05	129.941607	16.78825	0.367594	181.684883	68.253222	43.571497	11.010531	170.674352	59.208052	22.02988	...	0.220570	0.220570	0.257292	0.377827	1.348764	1.739515	2.897768	3.783786	4.163782	4.349547

5 rows × 2367 columns

What’s next?

Many of the features are redundant. At the same time, many features have NaN values, so you should consider to apply .dropna() before continuing.

Because of the high redundancy of many feature columns, you should consider to apply dimension reduction techniques like PCA to continue working with just a small set of non-redundant features.

Depending on your dataset, advanced high-dimensional visualization techniques such as t-SNE or UMAP are worth exploring.

[6]:

%load_ext watermark
%watermark -n -u -v -iv -w

Last updated: Fri Jul 02 2021

Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.19.0

bletl : 1.0.0
pandas: 1.2.1

Watermark: 2.1.0

[ ]: