Time Series Complexity analysis using Entropy
Every data scientist knows this: the first step to the solution of a Machine Learning problem is the exploration of the data.
And it's not only about understanding which features can help you in solving the problem. That is actually something that requires domain knowledge, a lot of effort, a lot of asking around and trying to find out. That is a necessary step, but in my opinion, is step number two.
The first step is in some way, shape, or form, based on the analysis of how complex your data is. Are they asking you to find fine details and pattern in something that is kind of always the same, or the outputs are completely different from each other? Do they want you to find the distance between 0.0001 and 0.0002 or do they want you to find the distance between 0 and 10?
Let me explain myself better.
For example, I am a signal processing guy. I studied Fourier Transform, Chirplet Transform, Wavelet Transform, Hilbert Transform, Time Series Forecasting, Time Series Clustering, 1D CNN, RNN, and a lot of other scary names.
A very common problem in the Time Series domain is going from an input (that might indeed be another time series) to a time series output. For example:
- You have a property of an experimental setup and you want to simulate your experiment using Machine Learning: this is actually my PhD Thesis and it's called surrogate modelling
- You have the values of the stock market up to day 300 and you want to predict day 301: this is very well known and it is called time series forecasting
- You have a signal that is very dirty or noisy and you want to clear it: this is called encoder-decoder signal denoising, and it is also very well known.
And in these problems, the first thing that I look at, surprisingly, is the output (not the input) time series.
Let's say that I take a random time series in my dataset. Is the time series a gentle and smooth combination of sines and cosines? Is it a polynomial function? Is it a logarithmic function? Is it a function I can't even name?
And if I take another random time series, how does it change? Is the task based on looking at small changes from an obvious baseline or is the task to identify completely different behaviors all across the dataset?
In a very single word, we are trying to understand how complex our task is: we are estimating the complexity of our time series. Now the word "complex" can mean something different for each one of us.
When my wife shows me her anatomy lessons I find them extremely complex, but for her it's just another Tuesday