Quickstart: Timeseries Feature Extraction
Some data sets are big enough and structurally suitable to apply machine learning methods. Timeseries data, however can not be fed into most machine learning algorithms directly.
With bletl.features
, you can apply a mix of biologically inspired and statistical methods to extract hundreds of features from timeseries of backscatter, pH and DO.
Under the hood, bletl.features
uses `tsfresh
<https://tsfresh.readthedocs.io>`__ and combines it with an extendable API that you may use to provide additional custom designed feature extraction methods.
[1]:
import pandas
import pathlib
from IPython.display import display
import bletl
from bletl import features
Parse the raw data file
[2]:
filepath = pathlib.Path(r"..\tests\data\BL1\NT_1200rpm_30C_DO-GFP75-pH-BS10_12min_20171221_121339.csv")
bldata = bletl.parse(filepath, lot_number=1515, temp=30)
Feature Extraction
You’ll need to provide a list of Extractor
objects for each filterset you want to extract from.
Additionally, you can specify the last_cycle
after which the timeseries will be ignored, for example because of sacrifice sampling.
[3]:
extractors = {
"BS10" : [features.BSFeatureExtractor(), features.StatisticalFeatureExtractor(), features.TSFreshExtractor()],
"pH" : [features.pHFeatureExtractor(), features.StatisticalFeatureExtractor(), features.TSFreshExtractor()],
"DO" : [features.DOFeatureExtractor(), features.StatisticalFeatureExtractor(), features.TSFreshExtractor()],
}
last_cycles = {
"A01" : 20,
"B01" : 50
}
The feature extraction itself takes a while. In this case roughly 3 minutes for all 48 wells.
[4]:
extracted_features = features.from_bldata(
bldata=bldata,
extractors=extractors,
last_cycles=last_cycles
)
Feature Extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00, 3.97it/s]
Feature Extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00, 3.84it/s]
Feature Extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:03<00:00, 3.98it/s]
Show extracted data
The extracted data is a big DataFrame
, indexed by well ID. Each column starts with the name of the filterset from which the data was analyzed, followed by a double underscore and the name of the extracted feature.
For tsfresh
-derived features, you’ll have to look up the meaning of the features in their documentation.
[5]:
extracted_features.head()
[5]:
BS10__inflection_point_t | BS10__inflection_point_y | BS10__mue_median | BS10__max | BS10__mean | BS10__median | BS10__min | BS10__span | BS10__stan_dev | BS10__time_max | ... | DO_x__fourier_entropy__bins_2 | DO_x__fourier_entropy__bins_3 | DO_x__fourier_entropy__bins_5 | DO_x__fourier_entropy__bins_10 | DO_x__fourier_entropy__bins_100 | DO_x__permutation_entropy__dimension_3__tau_1 | DO_x__permutation_entropy__dimension_4__tau_1 | DO_x__permutation_entropy__dimension_5__tau_1 | DO_x__permutation_entropy__dimension_6__tau_1 | DO_x__permutation_entropy__dimension_7__tau_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A01 | 15.502713 | 3.93831 | 1.533980 | 15.589340 | 13.236223 | 12.942862 | 11.786574 | 3.802765 | 1.135330 | 3.93831 | ... | 0.304636 | 0.600166 | 1.033562 | 1.468140 | 2.397895 | 1.613392 | 2.196756 | 2.566599 | 2.615631 | 2.639057 |
A02 | 114.049172 | 22.02844 | 0.512476 | 114.049172 | 43.789110 | 37.938058 | 10.662952 | 103.386220 | 27.608253 | 22.02844 | ... | 0.220570 | 0.257292 | 0.474328 | 0.474328 | 1.781950 | 1.728323 | 2.901951 | 3.792271 | 4.185139 | 4.454983 |
A03 | 158.104604 | 22.02891 | 0.013302 | 158.104604 | 52.730195 | 42.885822 | 11.259160 | 146.845445 | 39.335875 | 22.02891 | ... | 0.220570 | 0.314446 | 0.566976 | 0.750980 | 1.917544 | 1.574925 | 2.636769 | 3.491505 | 4.107095 | 4.405701 |
A04 | 148.610349 | 21.13505 | 0.153967 | 169.490066 | 52.947190 | 40.735298 | 11.072086 | 158.417980 | 41.251284 | 22.02937 | ... | 0.220570 | 0.257292 | 0.417984 | 0.603698 | 1.806336 | 1.593080 | 2.736457 | 3.516927 | 4.024063 | 4.280164 |
A05 | 129.941607 | 16.78825 | 0.367594 | 181.684883 | 68.253222 | 43.571497 | 11.010531 | 170.674352 | 59.208052 | 22.02988 | ... | 0.220570 | 0.220570 | 0.257292 | 0.377827 | 1.348764 | 1.739515 | 2.897768 | 3.783786 | 4.163782 | 4.349547 |
5 rows × 2367 columns
What’s next?
Many of the features are redundant. At the same time, many features have NaN
values, so you should consider to apply .dropna()
before continuing.
Because of the high redundancy of many feature columns, you should consider to apply dimension reduction techniques like PCA
to continue working with just a small set of non-redundant features.
Depending on your dataset, advanced high-dimensional visualization techniques such as t-SNE or UMAP are worth exploring.
[6]:
%load_ext watermark
%watermark -n -u -v -iv -w
Last updated: Fri Jul 02 2021
Python implementation: CPython
Python version : 3.7.9
IPython version : 7.19.0
bletl : 1.0.0
pandas: 1.2.1
Watermark: 2.1.0
[ ]: