The Physics Behind Data

As a physicist turned Data scientist, my colleagues and coworkers often ask me how my physics background is relevant.
Sure, from the mundane work, there isn't much in common beyond some overlap with mathematics and statistical techniques. However, philosophically, there is a profound connection:
Every data point corresponds to a real physical event that happened in the real world.
So in a sense, data can always be traced to physical phenomena.
This perspective, unfortunately, is rarely discussed in standard statistics or machine learning discourse. As a result, I believe a foundational understanding is lost in many of my fellow data practitioners. In this article, I'll describe in more detail how this physics-centric understanding can help us gain more insight into our data.
Everything is Time-Series

Where do data come from? They are derived from events in our Universe. Every event that happened, has happened at a location and a time. This fact comes from the principle of locality. This means that every list of data, no matter the format, is a (space)time-series.
This fact is often overlooked in Data Science, where instead of time series, one may work with a collection of images, texts, or simple tabular files. But within each sentence, each word is uttered or typed in a sequential manner, with brain signals firing one after another. Each document was produced at a specific location and time. The same is true for image or survey data. There is an original sequence of pixels firing or questions being answered at a specific location.
The spacetime nature of data is often ignored, but that doesn't mean it is unimportant. For example, scientific knowledge may change over time, so even the correctness of statements could change.
In fact, very subtle effects have been documented in particle Physics experiments, where effects such as moon phases (cern.ch) and nearby train schedules (G Brun et al.) have modified particle physics data in real life. It makes one wonder, how much of our other data is affected by these nuanced side-effects?
Practically, data science is much less precise than particle physics, and so one often justifies that details of when and where are small noises that can be ignored. But what makes it okay to ignore them? The answer comes from physics once again.
Everything is Scrambled

Ever tried to unscramble an egg? Or separate out a bucket of mixed paint into its original colors? Without some molecular gymnastics, it's downright impossible.
The tendency for nature to scramble things up is actually a blessing in disguise. Why? Because it turns out the scrambling happens rather indiscriminately. This means we can view the results of the scrambling in a uniform way, ignoring the details of the scrambling process itself. The mathematical details can be formalized in _Ergodic Theories_.
This physical fact allows us to use statistics to approximate our world. More precisely, when we measure our time-series for a long time (with many samples), the result can be approximated as if they come from some sort of statistical distribution. Roughly speaking, we can write (known as the Ergodic Theorem):
Time Average ≈ Statistical Average
However, one must keep in mind that this approximation isn't always valid, even though in many data science applications it is assumed so. Below are some required ingredients:
- The amount of stuff/possibilities in a system needs to be preserved. Technically, this is the "measure-preserving" property of a system's evolution, which generally applies in real systems.
- The time needs to be long enough. In other words, our sample size needs to be large enough.
- Long enough as in the system reaches a sort of (thermal) equilibrium, so that it is scrambling efficiently. When equilibrium happens, the system ignores many unmeasurable details like subtle differences in initial conditions.
While it is true that many data science tools, like machine learning (ML) modeling, don't require all these assumptions to be useful – their utility is often measured in real-time. It is still important to recognize their fundamental limits based on science. For instance:
- A ML model may see very poor performance in real life versus testing because the time series never reached any equilibrium. We also call this distribution shift in data science. This is a common occurrence in stock market modeling.
- A statistical model may perform very poorly even though the sample size doesn't seem small conventionally. This happens when the time average isn't large enough to satisfy the equilibrium condition. This may happen in social science or with extreme events.
So far, physics has helped us understand the limits of utilizing basic statistics to understand our world. But what about statistical reasoning and making causal inferences? Once again, physics imposes a limit that cannot be ignored.
Everything is Correlated

What caused the rain today? What caused the Covid-19 pandemic? What caused climate change? These casual questions lie at the heart of the utility of statistics. However, physics doesn't provide a clear answer to these questions. This is due to the _Butterfly Effect_, which accompanies systems that scramble very well (chaotic systems).
Why is the butterfly effect such a big deal? It tells us that even minute effects, like the flap of a butterfly's wing, can accumulate into enormous changes to a system. In fact, this divergence is often exponential in time:
system change ~ exp[ time × scrambling rate ]
So how do we answer questions like "What caused the rain?" Do we need to include the flap of every butterfly? What about people flapping their hands? Or even just random dust molecules moving about? If a tiny change of any of the minute details of world could alter the weather, it would seem like the entire Universe caused it to rain! But this notion just doesn't seem very useful.
Indeed, beyond this trivial fact that everything causes everything else, it isn't possible to define a strict sense of causality for each event in our Universe, because everything is correlated.
In order to make sense of causality, one needs to think of a hypothetical ensemble. Imagine the weather of 100 days with similar conditions instead of focusing on one specific day. Then we separate variables into ones we can control for versus ones we cannot control or don't care for. If we do this properly, the Butterfly effect averages out, and we are left with what we call causality.
What is the implication for data science? It means that we can never control for all the variables, and that statistical inferences will always have an irremovable bias based on what we consider as control or independent variables.
This is actually a feature of making inferences because humans are proactive beings and we want to change the world. We don't want to dwell on things like how flapping butterflies could ultimately affect our climate – we can't control detailed weather patterns. But what we can control, in the context of climate, are things like global emissions and carbon output. That's why it makes sense to focus on these other more controllable aspects.
The existence of these uncontrolled variables also hampers our efforts to predict outliers and extreme events. This is one reason why we were taken by surprise by black swan events such as flash crashes in stock markets, pandemics, and natural disasters.
Physics exposes a fundamental limitation: isolating data or variables as truly independent is impossible. Our very selection of variables introduces bias, and data scientists must be aware of these inherent limitations.
Conclusion
Hopefully, I've convinced you that understanding our physical world can make us better data scientists. From understanding data itself, to making models and drawing inferences, the laws of nature absolutely need to be taken into account.
I believe that, all too often, we try to separate different scientific disciplines. But fundamentally, they're all studies of the one Universe we inhabit. So it just makes sense that data science should both draw inspiration from and improve our understanding of other natural sciences.
If you enjoy this article, there are some more similar ones you might like:
Understanding Large Language Models: The Physics of (Chat)GPT and BERT
A Physicist's View of Machine Learning: The Thermodynamics of Machine Learning