What Statistics To Learn For Data Science

Author:Murphy | View: 24603 | Time: 2025-03-22 22:13:11

Let's be honest, maths, especially statistics, can be quite scary.

In one of my previous posts, I discussed the mathematics you need to become a high-caliber data scientist. In a nutshell, you need to know three key areas: Linear Algebra, Calculus, and Statistics.

Now, statistics is the most useful and important to grasp fully. Statistics is the backbone of many Data Science principles, you will use it every single day and even machine learning came from statistical learning theory.

I want to dedicate a whole post with a detailed roadmap of the statistics knowledge you should have as a data scientist and resources to learn all these things.

Obviously, statistics is a massive field, and you can't learn everything about it, especially with all the active research going on. However, if you have a solid working knowledge of the topics I will go over in this article, then you are in a very strong position.

If you want a full view of the field, this Wikipedia article summarises the whole statistics landscape.

Summary Statistics

Wikipedia defines a statistic as

"A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose."

In other words, a statistic summarises information about some given data, sample or population. So, the first thing a buddying data scientist should know is the different summary statistics to describe the data.

Summary statistics generally measure four things: location, spread, shape, and dependence. Below is a list of the key ones you should know:

Mean, Mode, and Median.
Variance, Standard Deviation, and Coefficient of Variation.
Skewness and Kurtosis.
Percentiles, Quartiles and Interquartile Range.
Spearman's and Pearson's Correlation Coefficient.

Most of the things above are taught at junior or middle school in most countries, so you are probably familiar with them already! If not, don't worry, they won't take much time to learn and understand.

Visualisations

Data scientists must be able to visually present their data and findings. Therefore, you should be aware of all the different types of plots to know the best way to present your results to stakeholders and colleagues.

The key ones to know are:

Bar Chart
Line Graph
Pie Chart
Scatter Plot
Violin Plots
Histogram
Frequency Diagrams
Box and Whisker Plot
Heat-maps and Contours

You can do more fancy things with these in combination, like stacked bar charts, but once you know the basics, these are straightforward to produce.

It will also be useful to learn how to produce these plots in Python using packages like Plotly and Matplotlib.

Probability Distributions

When people think of statistics, they will probably think of distributions. Probability distributions help us describe statistical events and understand the frequency and probability of specific outcomes from these events.

Their primary critical use in data science is to help us understand the type of relationship between our target variable and its dependent variables (commonly known as features). This ensures we choose the most suitable model for the task to maximize performance.

The most important ones to know are:

Normal Distribution
Poisson Distribution
Binomial Distribution
Gamma Distribution
Exponential Distribution
T-Distribution
Chi-Square Distribution
Probability Density Function
Cumulative Distribution Function

There are many, and I mean many, probability distributions out there. However, the ones listed above are the ones you will come across and use most frequently, based on my personal experience. Particularly if you are working in the insurance industry!

Probability Theory

Probability theory encompasses the whole mathematical modeling side of how probability works. This large area is sometimes referenced as separate from statistics due to its size and overlaps with some of the places I will or have discussed in this article thus far.

In fact, the probability distribution section technically falls inside probability theory, but I gave it a unique, separate section due to its brevity.

Anyway, the key things to know about probability theory are:

Random Variables
State Spaces
Samples, Populations, and Standard Error
Central Limit Theorem
Law of Large Numbers
Maximum Likelihood Estimation (very important for Machine Learning, as it explains where loss functions come from)

Probability theory is a big field, but the list above is particularly valuable information for data science and machine learning.

Hypothesis Testing

How do you know if a result is significant or just some random noise?

Well statistical testing, commonly known as hypothesis testing is used to determine this. Perhaps the most famous example in the data world is an AB test, which is used constantly by literally every company nowadays.

To fully understand hypothesis testing, you need to know the following concepts:

Confidence and Prediction Intervals
Significance Levels, Critical Values, and P-values
One vs Two-Tailed Test
Null and Alternative Hypotheses
Test Statistics
Z-Test
T-Test
Chi-Square Tests (both types)
ANOVA Test

Just learn the process and intuition behind how hypothesis testing is carried out. You also should understand when to use certain tests over others in different scenarios.

Regression Analysis

The first algorithm a data scientist typically learns about is linear regression. Despite machine learning being a relatively new field, linear regression dates back to the early 1800s, so it is quite an old statistical technique.

Linear regression is part of a broader area called regression analysis, which is all about estimating the relationship between a target variable and a set of features (known was covariates).

Many models and methods in regression analysis are still used today and are very effective.

As well as regular linear regression, you should learn the following:

Multivariate Linear Regression
Polynomial Regression
Generalized Linear Models
Generalized Additive Models
Gauss-Markov Theorem
Ordinary and Least Squares Estimation

Bayesian Statistics

There are two main branches and ways of thinking in statistics: frequentist and Bayesian. Most people "do" statistics in a frequentist framework, as that's the way most people are taught in schools. It's also a more accessible system to work with.

However, biologically, humans tend to think and work in a Bayesian mindset, and so this theory of statistics has been successfully applied in many experiments in the data science field. Many optimization algorithms in machine learning rely on a Bayesian approach instead of a frequentist one.

I believe Bayesian statistics is an important domain to learn, as many people use the term "Bayesian" loosely and most likely don't know what it truly means.

The key things you should cover are:

Marginal, Joint, and Conditional Probability
Bayes' Theorem
Bayes' Factor
Conjugate Priors
Bayesian Updating
Credible Intervals
Bayesian Regression

Stochastic Process

This area is optional for an entry-level position, but you will likely use at least one point in your data science career. So, it is well worth knowing about if you have the time in your learning road map.

A stochastic process is a sequence of events (typically time-indexed) of random variables. Stochastic processes model many phenomena, such as how water molecules move to the stock market price.

The following areas are the fundamentals of stochastic processes:

Markov Property
Markov Chains
Hidden Markov Models
Random Walks
Geometric Brownian Motion
Ito Calculus (This is quite advanced)

Resources

Now, there are endless resources to learn about all the topics I listed above. There are even volumes of textbooks for each sub-domain. However, we just need to know the general gist for entry-level data science positions and the main intuition behind them. We certainly don't need a PhD level of understanding (although that would certainly help!).

Textbooks

In terms of textbooks, my go-to would be the Practical Statistics for Data Science textbook, which covers everything, is specifically designed for data scientists, and gives concrete examples using Python, which is very useful.

Practical Statistics for Data Scientists

Another great and more advanced option would be An Introduction to Statistical Learning, although it is pretty mathematically dense.

An Introduction to Statistical Learning: with Applications in Python (Springer Texts in Statistics)

For Bayesian statistics, I recommend Think Bayes by Allen Downey. This is the book I used to learn Bayesian statistics, and it's indeed a treasure trove – not to mention free online!

Think Bayes 2

Courses

If courses are more your cup of tea, there are endless possibilities. I recommend trying the following from Coursera:

Introduction to Statistics

Probability & Statistics for Machine Learning & Data Science

It doesn't matter which one you pick as long as it covers a good chunk of the abovementioned things!

Others

There are other valuable platforms out there that you can check out to enhance your learning.

W3Schools:

Statistics Tutorial

Brilliant:

Brilliant | Learn interactively

Khan Academy:

Khan Academy | Free Online Courses, Lessons & Practice

These learning platforms are excellent reference texts to supplement any textbook or course you are working through. They can also be used on their own if you want!

My Posts

I have also written several articles in the above areas. These blogs will give you a good overview of all the concepts in a digestible way!

Statistical Tests

Probability Distributions

Markov Chains

Bayesian Statistics

General Statistics

Summary & Further Thoughts

Statistics is arguably the most essential and used field within data science. However, it's enormous and may seem overwhelming at the beginning for someone looking to learn the field. For entry-level data scientists, you don't need to know everything to a PhD level or depth of understanding. Having excellent working knowledge and strong intuition is enough, in my opinion. I hope this article has helped you design your statistics learning roadmap and find valuable resources!

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.