Quickstart: Timeseries Feature Extraction

Some data sets are big enough and structurally suitable to apply machine learning methods. Timeseries data, however can not be fed into most machine learning algorithms directly.

With bletl.features, you can apply a mix of biologically inspired and statistical methods to extract hundreds of features from timeseries of backscatter, pH and DO.

Under the hood, bletl.features uses `tsfresh <https://tsfresh.readthedocs.io>`__ and combines it with an extendable API that you may use to provide additional custom designed feature extraction methods.

[1]:
import pandas
import pathlib
from IPython.display import display

import bletl
from bletl import features

Parse the raw data file

[2]:
filepath = pathlib.Path(r"..\tests\data\BL1\NT_1200rpm_30C_DO-GFP75-pH-BS10_12min_20171221_121339.csv")
bldata = bletl.parse(filepath, lot_number=1515, temp=30)

Feature Extraction

You’ll need to provide a list of Extractor objects for each filterset you want to extract from.

Additionally, you can specify the last_cycle after which the timeseries will be ignored, for example because of sacrifice sampling.

[3]:
extractors = {
    "BS10" : [features.BSFeatureExtractor(), features.StatisticalFeatureExtractor(), features.TSFreshExtractor()],
    "pH" : [features.pHFeatureExtractor(), features.StatisticalFeatureExtractor(), features.TSFreshExtractor()],
    "DO" : [features.DOFeatureExtractor(), features.StatisticalFeatureExtractor(), features.TSFreshExtractor()],
}
last_cycles = {
    "A01" : 20,
    "B01" : 50
}

The feature extraction itself takes a while. In this case roughly 3 minutes for all 48 wells.

[4]:
extracted_features = features.from_bldata(
    bldata=bldata,
    extractors=extractors,
    last_cycles=last_cycles
)
100.00% [48/48 02:31<00:00]
Feature Extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00,  3.97it/s]
Feature Extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00,  3.84it/s]
Feature Extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00,  3.98it/s]

Show extracted data

The extracted data is a big DataFrame, indexed by well ID. Each column starts with the name of the filterset from which the data was analyzed, followed by a double underscore and the name of the extracted feature.

For tsfresh-derived features, you’ll have to look up the meaning of the features in their documentation.

[5]:
extracted_features.head()
[5]:
BS10__inflection_point_t BS10__inflection_point_y BS10__mue_median BS10__max BS10__mean BS10__median BS10__min BS10__span BS10__stan_dev BS10__time_max ... DO_x__fourier_entropy__bins_2 DO_x__fourier_entropy__bins_3 DO_x__fourier_entropy__bins_5 DO_x__fourier_entropy__bins_10 DO_x__fourier_entropy__bins_100 DO_x__permutation_entropy__dimension_3__tau_1 DO_x__permutation_entropy__dimension_4__tau_1 DO_x__permutation_entropy__dimension_5__tau_1 DO_x__permutation_entropy__dimension_6__tau_1 DO_x__permutation_entropy__dimension_7__tau_1
A01 15.502713 3.93831 1.533980 15.589340 13.236223 12.942862 11.786574 3.802765 1.135330 3.93831 ... 0.304636 0.600166 1.033562 1.468140 2.397895 1.613392 2.196756 2.566599 2.615631 2.639057
A02 114.049172 22.02844 0.512476 114.049172 43.789110 37.938058 10.662952 103.386220 27.608253 22.02844 ... 0.220570 0.257292 0.474328 0.474328 1.781950 1.728323 2.901951 3.792271 4.185139 4.454983
A03 158.104604 22.02891 0.013302 158.104604 52.730195 42.885822 11.259160 146.845445 39.335875 22.02891 ... 0.220570 0.314446 0.566976 0.750980 1.917544 1.574925 2.636769 3.491505 4.107095 4.405701
A04 148.610349 21.13505 0.153967 169.490066 52.947190 40.735298 11.072086 158.417980 41.251284 22.02937 ... 0.220570 0.257292 0.417984 0.603698 1.806336 1.593080 2.736457 3.516927 4.024063 4.280164
A05 129.941607 16.78825 0.367594 181.684883 68.253222 43.571497 11.010531 170.674352 59.208052 22.02988 ... 0.220570 0.220570 0.257292 0.377827 1.348764 1.739515 2.897768 3.783786 4.163782 4.349547

5 rows × 2367 columns

What’s next?

Many of the features are redundant. At the same time, many features have NaN values, so you should consider to apply .dropna() before continuing.

Because of the high redundancy of many feature columns, you should consider to apply dimension reduction techniques like PCA to continue working with just a small set of non-redundant features.

Depending on your dataset, advanced high-dimensional visualization techniques such as t-SNE or UMAP are worth exploring.

[6]:
%load_ext watermark
%watermark -n -u -v -iv -w
Last updated: Fri Jul 02 2021

Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.19.0

bletl : 1.0.0
pandas: 1.2.1

Watermark: 2.1.0

[ ]: