Philosophy and data science – Thinking deeply about data

Author:Murphy | View: 28099 | Time: 2025-03-22 23:30:54

Image by Cottonbro Studios from Pexels.com

My hope is that by the end of this article you will have a good understanding of how philosophical thinking around causation applies to your work as a data scientist. Ideally you will have a deeper philosophical perspective to give context to your work!

This is the third part in a multi-part series about philosophy and data science. Part 1 covers how the theory of determinism connects with data science and part 2 is about how the philosophical field of epistemology can help you think critically as a data scientist.

Philosophy and Data Science -Thinking deeply about data

Philosophy and Data Science – Thinking Deeply About Data

Introduction

I love how many philosophical topics take a seemingly obvious concept, like Causality, and make you realize it is not as simple as you think. For example, without looking up a definition, try to define causality off the top of your head. That is a difficult task – for me at least! This exercise hopefully nudged you to realize that causality isn't as black and white as you may have thought.

Here is what this article will cover:

Challenges of observing causality
Deterministic vs probabilistic causality
Regularity theory of causality
Process theory of causality
Counterfactual theory of causality
Bringing it all together

Causality's Unobservability

David Hume, a famous skeptic and one of my favorite philosophers, made the astute observation that we cannot observe causality directly with our senses. Here's a classic example: we can see a baseball flying towards the window and we can see the window break, but we cannot see the causality directly. We cannot see that the window must break. This is the primary challenge of causality, we have to infer it from our observations.

"All events seem entirely loose and separate. One event follows another; but we never can observe any tie between them."

David Hume, An Enquiry Concerning Human Understanding

I think one way to understand that causality is not directly observable is through a thought experiment. Imagine a parallel universe with different physical laws. Every time there is a causal interaction, a purple light shines. In this universe, science is very easy. If I hit one pool ball with another, a purple light would flash indicating that the first pool ball caused the second to roll across the table. In this universe, we can observe causality directly!

Now, back in our universe, when one pool ball hits another, we cannot physically observe the causation, we instead have to look at the events that happen and make inductive interpretations. Perhaps we repeat the two pool balls hitting multiple times and observe that every time the second ball rolls across the table. Then we deduce from our indirect observation of causation that one ball causes the other to move.

Because we can't directly observe causality, there are many theories and definitions of how we can identify it through indirect means.

Deterministic vs. probabilistic causality

As I mentioned above, the first article in this series was about determinism in Data Science. I tangentially covered causality and determinism in the portion about design of experiments. Theories on determinism or the lack thereof have strong influence on deep thinking around causality.

Deterministic causality states that causal relationships have no elements of randomness in them. If A causes B, then A will cause B in the exact say way every single time given other conditions are held constant. I think this is best demonstrated in an example. Say we are doing an experiment where we drop a rubber ball from a height of three feet and record the height of the first bounce. Now, suppose we can control every single factor in the experiment – if the theory of deterministic probability (or theory of determinism more broadly) is true, we could hypothetically remove all variance from the height of the bounce. In other words, the ball would bounce the exact same height every single time without any difference between trials.

On the other hand probabilistic causality proposes that there is some randomness in the causal relationship. If A causes B, we will still have some variance in how A causes B, even if everything else is held constant. In our example of the rubber ball, we would not observe the exact same bounce height caused by the drop, even if we hold everything else equal.

At a more superficial level, these two theories can be reconciled by acknowledging our epistemic limits (I talked a little too much about this in the first article!). Epistemic limits are the limits of what we can learn/observe about the world around us. Both views can lead to the same conclusion by saying that the randomness we perceive is just because we can't account for, or control all relevant factors in our analysis or experiments. Even in the simple example of the rubber ball drop, it is easy to think of multiple things that would be difficult to completely control; temperature, atmospheric pressure, release technique, height measurement error etc… One of the most difficult things to control for would be the ball itself! If you use the same ball multiple times, the impact of the prior drops may change the ball's bounce, if you use a new ball, imperfections in the manufacturing process mean that not every ball is exactly the same – this will introduce randomness!

Pragmatically, these two theories give use the same results – since we obviously have significant epistemic limits, our approaches to understanding causation will be the same. We generally think of causation more along the probabilistic lines in data science.

Note, as we control for more things and the experiment gets cleaner and cleaner, the variance of the response will go down. The question between deterministic vs. probabilistic causality is whether or not the variance can get to zero for every causal relationship.

Regularity theory of causality

One of the more simple philosophical definitions of causality comes from the regularity theory. This theory says that causality can be established by observing that one event regularly follows another event.

Regularity theory defines causation by the regular sequencing of events. Philosophers that subscribe to the regularity theory don't have to differentiate between correlation and causation, they define causation by correlation. There doesn't have to be an intrinsic connection. If we observe that every time I let go of an object, it falls. I can conclude (from induction) that letting go of an object causes it to fall.

While redefining causality to simple correlation certainly makes identifying ‘causation' easier, it really doesn't offer us much practical knowledge! If we redefine correlation as causation and then we decide to act on that knowledge we may be gravely disappointed when the intervention is unsuccessful.

Imagine that we have a pool in our backyard. In the pool we have a pool toy and a bunch of leaves. We observe that the leaves and the toy tend to be in the same part of the pool. Using regularity theory, we could say that the toy causes the leaves to move with it. From the regularity theory philosophical stand-point, there isn't anything wrong with that. But, let's say that now we want to get all of the leaves close to the edge of the pool so we can easily remove them. We decide that since the toy 'causes' the leaves to be close to it, we move the toy to the edge of the pool. Will the leaves follow? Of course not, because the definition of causation proposed by the regularity theory doesn't necessarily extend to this kind of intervention.

By most definitions of causation, we would say that the leaves and the pool toy have a correlative relationship, not a causal relationship. The wind, or the pool jets are the factors that actually cause the leaves to move around the pool. The toy is also impacted by those same forces as well, which is why the leaves and the toy tend to be in the same areas. It is important to note that regularity theory does not create a definition of causation that is strong enough to differentiate between correlation and causation.

A toy in a pool tends to be near the leaves, according to regularity theory, that is sufficient to say that the toy causes the leaves to be near it (image by author)

This definition of causality doesn't support intervention, if we move the toy, the leaves will not move with it! ( Image by author)

As data scientists, we add the most value by recommending actions; if we were to adopt the regularity theory of causation's definition, our actions may or may not be useful. We need a stronger definition of causation; one the supports using causation to make changes to get desirable results!

Process theory of causality

Process theory seeks to understand the reason behind causation. It looks to explain the relationships between events. In the example in the last section, we would be hard-pressed to look for a process or mechanism that the pool toy has that moves the leaves. The explanation of the wind and jets moving the leaves would provide a much better understanding of the process that causes the leaves to move.

Another example of explaining a causal process is; ‘heat causes butter to melt because when heat energy is transferred to the butter's atoms, they move more thus making the butter melt.' Here, we identify causation by explaining the specific causal process.

This approach can work great and help solve the problem correlation vs. causation created by the regularity theory. In this theory, we require some kind of explanation. We are no longer satisfied with just observing that two events tend to be connected. If we can't come up with a satisfactory explanation, we may look deeper to find a causal relationship that we can explain.

On the surface this looks pretty fool-proof, but I have one major problem with the process theory as a stand-alone philosophical approach to causality – It is easy to come up with a process that seems correct, but is actually wrong! We seem to have an innate desire to make causal conclusions in our world¹. The challenge here is that we can often use our tendency to make incorrect causal conclusions.

Nissam Taleb tells a funny story about this exact problem in his book ‘The Black Swan.' He tells a story of a news article he observed on the day that Saddam Hussein was captured. At the onset of the news, the bond market was up and the stock market was lower – an article's title read:

‘U.S. Treasuries Rise; Hussein Capture May Not Curb Terrorism'

Here's a break down of the title in causal terms:

Later in the day, the market trends reversed and the title of the article was edited to:

‘U.S. Treasuries Fall; Hussein Capture Boost Allure of Risky Assets'

The writer of the article was able to fabricate feasible causal processes for both higher and lower treasury markets from the exact same news! Clearly both explanations cannot be correct at the same time.

I'm not proposing that the Process Theory approach to causality is fatally flawed by this problem. I simply want to point out that just because you can come up with a seemingly coherent explanation to a relationship does not mean that you are correct! The extra thought required to come up with a causal process or mechanism can be very helpful indeed, but we must be wary that not everything that makes sense is right!

Counterfactual Causation

The counterfactual causation approach to causality establishes causal relationships between events by asking ‘What would've happened had things been different?' This questions causes us to simulate an alternative world, where things transpired differently – this alternative world is the counterfactual. By assessing the difference in states of the actual world (factual) and our simulated world (counterfactual), we can establish causal relationships.

Once again, I think it is easiest to understand via example:

Let's say that we threw a baseball at a window and the window broke. We could use the counterfactual approach to ask "If we did not throw the ball at the window, would it still have broken?" By applying logic to the question's answer we can deduce the causal relationship between the two events.

Below is a table showing the "factual" and the "counterfactual" of the baseball/window relationship:

From the counterfactual, we can conclude that if the ball had not been thrown, then the window would not have been broken. Therefore, the ball being thrown caused the window to break.

Here is a silly example to show how the counterfactual approach would deny a causal relationship between events:

Since it would've rained whether or not I called a friend, we can say that the events are independent. The rain was not caused by me calling a friend.

As you are reading this, you may have a major complaint with this approach – I do! The challenge is, how do we know if the counterfactual is correct? It is, by definition not observable – it is made up! Even if the answer seems obvious (the baseball/window example seems very straight forward, but we can't know for sure that window wasn't about to spontaneously break at the exact moment that the ball hit it!) we still can't be sure that our counterfactual is actually what would've happened. The only way to know with 100% confidence that a counterfactual was correct would be to do event A, observe the results and then go back in time and not do event A and observe the results. This of course is impossible! While we don't have access to time machines in data science, we do have a few techniques to do more than just speculate what the counterfactual would be. The two main tools I have used are (1) testing and (2) modeling.

With testing, we try to simulate counterfactuals through replication. In the baseball example, we would have multiple windows, some windows we would throw balls at and others we would not. We try to control for as many differences as possible between the experiments (i.e., the same brand of window and the same speed of baseball thrown). We use random assignment and statistical techniques to mitigate the impact of other factors that we can't control. We observe the results and then make conclusions based on what we observe. Testing is really just simulating counterfactuals!

While testing is the ideal way of simulating counterfactuals, it is time consuming and expensive. I have used specific modeling techniques to quickly and cheaply simulate counterfactuals. We can create a model with the target variable as the caused event and a predictor as the causal event in question. For example, a customer's propensity to buy a product is the target variable and the percent discount is a predictor that we suspect has a causal relationship. We can then adjust the discount percent variable in the model to create counterfactual predictions of what the customer would do under various discount levels. This is just an example of a way to use modeling to create counterfactuals. This approach has a lot of assumptions that are beyond the scope of this article. There are also multiple other ways to use modeling to create counterfactuals that I will not cover here.

Thinking in counterfactuals can be a powerful way to understand causal relationships. We do, however, have to feel confident that the counterfactuals we ideate or create through data science techniques actually reflect what would've happened under different circumstances.

Bringing it all together

The Philosophy of causality gives data scientists a lot of useful perspectives on how causality can be understood and used to add data driven value.

We typically think of causality from a probabilistic perspective. Meaning if event A impacts the probability of event B, then event A has a causal relationship on event B. Whether or not the true nature of causality is probabilistic or we just perceive it as so, depends on whether or not the universe is deterministic.

Regularity theory helps us identify relationships that are likely a part of a causal eco-system. The related events may or may not cause each other (in the way that we typically think of causation), but if we want to understand causality, they are an important part of the puzzle.

Process theory requires us to come up with an explanation of why one thing causes another. If the explanation is supported by the data and our domain knowledge, it can be an important to establish causation that is useful. It can also help us avoid making incorrect causal conclusions based simply on correlation.

Counterfactuals takes it a step further to help us run thought experiments that get to the heart of useful causality. We think of what would or would not happen under different circumstances. This lends well to making recommendations, because we can think about what would or would not happen if we execute an intervention. Testing/experimentation and specific modelling techniques can help us make more data driven counterfactuals, which in turn, can inform recommended actions.

All of the philosophical approaches to causality we discussed help us overcome the fact that we can't observe causation directly. Each theory gives us a useful perspective. A balanced approach in using the various schools of thought can lead to better data driven recommendations!