Lessons from COVID-19: Why Probability Distributions Matter

Author:Murphy  |  View: 24652  |  Time: 2025-03-22 19:08:52
Photo by Glen Carrie on Unsplash

If you are not a paid member on Medium, I make my stories available for free: Friends link

If you've been following my articles, you've probably noticed my recent emphasis on probability distributions. I've spent a lot of time talking about their importance, and for good reason. If you've already grasped why these distributions are crucial, this article will serve as a nice reinforcement. If not, I hope this article will provide some new insights for you!

Let me ask you a question

Why are probability distributions so important? Why do we spend so much time studying probability density functions (PDFs) and cumulative distribution functions (CDFs)? Hint: The answer depends on whom you ask.

However, you probably came here for a more direct answer. So… for me… I'll answer that question by helping you understand extreme values like Xₘᵢₙ and Xₘₐₓ (If you just want the answer, please skip to the end of the article)!

Hopefully my explanations are intuitive and accessible, without unnecessary jargon.

Also… This will likely be my final article focusing solely on probability distributions (though I might cover specific ones like the t-distribution, z-distribution, or chi-squared distribution in the future). From here, I plan to explore more complex concepts!


Table of Contents

  1. Quick Reflection of Terminologies
  2. Why are probability distributions so important?
  3. Distribution of Xₘₐₓ
  4. Answers for Xₘᵢₙ
  5. Summary

Please skip this section if you want to get to the main part!

Before we begin… A Quick Reflection

It's possible that you might get some notations confused going forward, so I'll clarify this quickly.

Terminologies

Distribution of "X": When we say ‘Distribution of X,' we refer to the way the probabilities are distributed across the possible values of X. For continuous variables, this is represented by the Probability Density Function (PDF), and for discrete variables, it is represented by the Probability Mass Function (PMF).

Upper case "X": Upper case ‘X' represents a quantity or measurement that can vary in a random process. It doesn't have a fixed value but instead follows a probability distribution.

Lower case "x": Lower case ‘x' is a fixed number or threshold you're interested in when analyzing X. It's the value you use to calculate probabilities or define events.

Probability Density Functions (PDFs)

The PDF gives us the relative likelihood of different values occurring. It applies to continuous variables, so while the height of the curve tells us about likelihood, the actual probability lies in the area under the curve.

f(x) is the Probability Density Function

Example: Let's say Medium is analyzing the time an average reader spends on an article. Since Probability Density Function is for continuous variables, we can't get the probability of an exact value.

I know it's confusing to understand so let me explain it to you in a different way. What would be the probability of a reader reading an article for exactly 10 minutes? Think about it..! What are the realistic chances that the the read time is precisely at 10 minutes to the second?

It's zero! Why?

Even if the reader stops reading at 10 minute and 0.000001 seconds, it's technically not the exact time you guessed! For continuous variables, probability for an exact value is always zero.

This is why we need to determine the probability over an interval!

Key takeaway: The PDF helps us answer questions like, "What's the probability of a reader spending between 5 and 10 minutes on an article?"

Cumulative Distribution Functions (CDFs)

The CDF takes it a step further by telling us the probability of a value falling below a specific threshold.

f(t) is the Probability Density Function

Example: Using the same scenario, the CDF answers, "What percentage of readers spend less than 10 minutes on an article?"

Key takeaway: The CDF builds on the PDF, providing cumulative probabilities for better decision-making.


So…. WHY do we really care about distributions?

Probability distributions are at the core of modeling uncertainty in real-world data. As I've often heard from colleagues, professors, and mentors in the field: data scientists rarely deal with certainty. And that's exactly why we (the data scientists) are there for! To help understand some of that uncertainty and create data-driven analysis that helps management make actionable decisions.

Honestly, the above paragraph seems a little too cliché for me (even though I wrote it). So, I'll be more clear-cut. Probability distributions offer a foundational framework for understanding and analyzing data by:

  • Describing the likelihood of different outcomes
  • Enabling data scientists to make informed predictions and decisions based on patterns observed in datasets

Now, Can You Answer These Questions?

  • What is the distribution of Xₘₐₓ?
  • What is the distribution of Xₘᵢₙ?

Note: It's important to be able to answer these questions as it means that you have really mastered the foundation of these probability density and cumulative distribution functions. (Plus, these questions frequently come up as introductory topics in data science interviews.)


Distribution of Xₘₐₓ

Before we go into determining the distribution of Xₘₐₓ, let's first ask ourselves… Why we would ever want to calculate the distribution of Xₘₐₓ in practice?

In practice, the maximum value (Xₘₐₓ) often represents the most extreme condition or scenario that a system might encounter. Knowing the distribution of Xₘₐₓ​ allows us to assess the probability and severity of these events so we can make critical decisions before they happen.

In short, the distribution of Xₘₐₓ​ provides valuable insights, such as:

  • The most likely range of extreme values.
  • The probability of exceeding a critical threshold.
  • The expected maximum value in repeated scenarios.

Unfortunately for all of us, the following example will give us a great idea of what I mean.


Photo by Annie Spratt on Unsplash

An example that we all felt recently: COVID-19

During the Covid-19 pandemic, no one could initially predict how devastating its impact would be. Governments and healthcare systems worldwide scrambled to understand and respond to the crisis. Hospitals faced a critical question:

"Can we estimate the maximum number of daily hospital admissions during this outbreak to allocate resources effectively?"

This wasn't just a theoretical exercise. Accurately predicting the peak number of hospital admissions was crucial for managing limited resources like beds, staff, and ventilators. Planning for worst-case scenarios – extreme spikes in hospitalizations – could mean the difference between life and death.

1. Modeling Daily Admissions as a Distribution of Xₘₐₓ

Let's assume that daily hospital admissions during the COVID-19 outbreak followed some pattern, represented by a random variable X. For simplicity, let's model this as a normal distribution with:

  • Mean (μ): 100 patients/day (average daily admissions)
  • Standard deviation (σ): 20 patients/day (variability in admissions)

Suppose the University of Washington Medical Center observed hospital admissions over a 30-day period (n = 30). The goal was to determine the distribution of the maximum daily admissions (Xₘₐₓ​) during this time frame. Remember that when we say the distribution, we are referring to the probability density function (PDF).

In order to find the PDF, we have to revisit our old cumulative distribution function (CDF) equation first.

2. Connecting the CDF to​ Xₘₐₓ

X is the random variable, x is the threshold value

From our introduction earlier, we understand that the CDF represents the probability that a random variable X (daily admissions) is less than or equal to a given threshold x (e.g., 150 admissions).

By definition, the CDF always shows the probability of X being less than or equal to x (as seen in the equation above), which is a crucial concept.

Now, applying this understanding to calculate the CDF for the maximum daily admissions (Xₘₐₓ), we make the following observation: The value of Xₘₐₓ​ will be less than or equal to x if, and only if, all daily admissions over "n" days are also less than or equal to x.

CDF for maxmium of X

For example, consider a threshold of 150 admissions. If we say that the maximum daily admissions (Xₘₐₓ​) must be less than or equal to 150, it implies that every single day in the 30-day period must have daily admissions less than or equal to 150.

You might wonder… Why [ Fₓ(x) ]ⁿ?

This formula might seem abstract at first, but it's quite intuitive upon if you look at it from a wider view.

  • F(x): Probability that admissions on a single day are below x,
  • [F(x)]ⁿ: Probability that admissions are below x for all n days.

For example, if F(x) is 0.9 (90% probability of being below x), then the chance of this happening across let's say 5 days is (0.9)⁵, or approximately 59.04%.

3. Calculating the Distribution of Xₘₐₓ

We have the CDF now… what can we do to get our distribution (PDF)? Well… we know that the distribution (PDF) can be obtained by differentiating its CDF! So, we can get the distribution of Xₘₐₓ:

Distribution (PDF) of Xₘₐₓ
  • F(x): CDF of daily admissions
  • f(x): PDF of daily admissions
Distribution (PDF) of Xₘₐₓ​ with maximum daily hospital admissions over 30 days, assuming daily admissions follow a normal distribution with μ= and σ=20.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Parameters for the normal distribution of daily admissions
mu = 100  # Mean daily hospital admissions
sigma = 20  # Standard deviation of daily admissions
n = 30  # Sample size (days in a month)

# Generate arbitrary x values for the PDF and CDF
x_values = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 1000)
pdf_X = norm.pdf(x_values, mu, sigma)  # PDF of X
cdf_X = norm.cdf(x_values, mu, sigma)  # CDF of X

# PDF and CDF of Xmax
pdf_Xmax = n * (cdf_X**(n - 1)) * pdf_X
cdf_Xmax = cdf_X**n

# Plot the PDF of Xmax
plt.figure(figsize=(10, 6))
plt.plot(x_values, pdf_Xmax, label=f"PDF of $X_{{max}}$ (n={n})")
plt.axvline(mu, color='red', linestyle='--', label='Mean of X')
plt.title("PDF of $X_{max}$ (Maximum Daily Admissions)")
plt.xlabel("Daily Hospital Admissions")
plt.ylabel("Probability Density")
plt.legend()
plt.grid()
plt.show()

This equation gives us the probability distribution for the maximum daily hospital admissions over a 30-day period, enabling hospitals to anticipate and prepare for the peak demand during the outbreak.

4. What can we derive from this insight?

Just by looking at this distribution, we can see that the distribution is skewed towards the higher values (right side) than the original normal distribution that we see (centered around the mean).

The peak of Xₘₐₓ occurs at a higher value than the average reflecting the likelihood of observing a daily maximum admissions for patients that exceeds the mean that we currently.

5. Then what can we do?

With this distribution and the CDF that we calculated, we can compute probabilties for specific scenarios, such as the chance that Xₘₐₓ exceeds a certain value (e.g., 150 patients/day).

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Parameters for the normal distribution of daily admissions
mu = 100  # Mean daily hospital admissions
sigma = 20  # Standard deviation of daily admissions
n = 30  # Sample size (days in a month)
threshold = 150  # Threshold for visualization

# Probability that Xmax > threshold
prob_exceeds_threshold = 1 - (norm.cdf(threshold, mu, sigma) ** n)
prob_exceeds_threshold
print(prob_exceeds_threshold) # 0.17044906317457698

# ------------------------------------------------------------ #

# Generate x values for visualization
x_values = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 1000)
pdf_X = norm.pdf(x_values, mu, sigma)  # PDF of X
cdf_X = norm.cdf(x_values, mu, sigma)  # CDF of X
pdf_Xmax = n * (cdf_X**(n - 1)) * pdf_X  # PDF of Xmax

# Highlight area where Xmax > threshold
x_highlight = np.linspace(threshold, x_values[-1], 500)
pdf_highlight = n * (norm.cdf(x_highlight, mu, sigma)**(n - 1)) * norm.pdf(x_highlight, mu, sigma)

# Plot the PDF of Xmax
plt.figure(figsize=(10, 6))
plt.plot(x_values, pdf_Xmax, label=f"PDF of $X_{{max}}$ (n={n})", color='blue')
plt.fill_between(x_highlight, pdf_highlight, color='orange', alpha=0.6, label=f"Area where $X_{{max}} > {threshold}$")
plt.axvline(threshold, color='green', linestyle='--', label=f"Threshold = {threshold}")
plt.title("PDF of $X_{max}$ with Highlighted Area for $X_{max} > 150$")
plt.xlabel("Daily Hospital Admissions")
plt.ylabel("Probability Density")
plt.legend()
plt.grid()
plt.show()

The probability that the maximum daily hospital admissions (Xₘₐₓ) over 30 days exceeds 150 is approximately 17.04%.

This means there is a significant chance that hospital admissions could surpass 150 patients on at least one day in the 30-day period, which would be critical for resource planning.


Do you see how essential probability distributions are? They truly lay the groundwork for addressing a wide range of problems that more advanced statistical methods can build upon.

We haven't tackled Xₘᵢₙ, but I'll leave that as a challenge for you! Feel free to solve it on your own and share your thoughts in the comments. Of course, I'll provide the solution (equation) as well, but I highly encourage you to pause here, work through it first, and then continue reading!


What is the distribution of Xₘᵢₙ​?

The distribution of Xₘᵢₙ​​ reflects the probability of observing the smallest value in a sample. Similar to Xₘₐₓ​, its cumulative distribution function (CDF) is derived by considering the probability that all values in the sample are greater than or equal to a given threshold. The formula is:

CDF of Xₘᵢₙ

Of course, now we can find the PDF by taking the derivative:

PDF of Xₘᵢₙ
Photo by Levi Meir Clancy on Unsplash

Real-life applications with COVID-19

During the COVID-19 pandemic, hospitals faced unprecedented pressure on resources such as ICU beds, ventilators, and healthcare personnel. Assessing the minimum availability of medical resources was critical for planning and ensuring patient care during the worst phases of the outbreak.

Much like the sudden influx of patients that resulted in an examination of the daily maximum admissions, the possibility of the shortage of equipment was an issue as well. Calculating the PDF and CDF of the availability of resources allowed decision-makers to:

  1. Move ventilators to at-risk hospitals.
  2. Alert state authorities for additional resources.
  3. Prioritize cases for ventilator use to maximize survival rates.

Summary

Using the COVID-19 pandemic as a case study, I showed you how understanding the "Xₘₐₓ​" enables decision-makers to anticipate and plan for worst-case scenarios, like extreme surges in hospital admissions.

Similarly, analyzing "Xₘᵢₙ"​ highlights resource shortages, such as ICU beds or ventilators, emphasizing the importance of minimum availability in critical planning.

So, why are probability distributions so important?

To me, probability distributions are essential because they form the foundation of data-driven decision-making. They describe the likelihood of various outcomes, enabling data scientists to quantify uncertainty, identify patterns, and make informed predictions.

In short, probability distributions are the backbone of understanding, modeling, and solving real-world problems using data.

I hope you were able to learn something!


Connect with me!

If you made it this far, I assume you are an aspiring data scientist, a teacher in the Data Science field, a professional looking to hone your craft, or just an avid learner in a different field! I would love to have a chat with you on anything!

For those wondering about my images: Unless otherwise noted, all images are by the author (myself)

Sunghyun Ahn – Medium

Tags: Covid-19 Data Science Getting Started Probability Statistics

Comment