What Statistics To Learn For Data Science
Let's be honest, maths, especially statistics, can be quite scary.
In one of my previous posts, I discussed the mathematics you need to become a high-caliber data scientist. In a nutshell, you need to know three key areas: Linear Algebra, Calculus, and Statistics.
Now, statistics is the most useful and important to grasp fully. Statistics is the backbone of many Data Science principles, you will use it every single day and even machine learning came from statistical learning theory.
I want to dedicate a whole post with a detailed roadmap of the statistics knowledge you should have as a data scientist and resources to learn all these things.
Obviously, statistics is a massive field, and you can't learn everything about it, especially with all the active research going on. However, if you have a solid working knowledge of the topics I will go over in this article, then you are in a very strong position.
If you want a full view of the field, this Wikipedia article summarises the whole statistics landscape.
Summary Statistics
Wikipedia defines a statistic as
"A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose."
In other words, a statistic summarises information about some given data, sample or population. So, the first thing a buddying data scientist should know is the different summary statistics to describe the data.
Summary statistics generally measure four things: location, spread, shape, and dependence. Below is a list of the key ones you should know:
- Mean, Mode, and Median.
- Variance, Standard Deviation, and Coefficient of Variation.
- Skewness and Kurtosis.
- Percentiles, Quartiles and Interquartile Range.
- Spearman's and Pearson's Correlation Coefficient.
Most of the things above are taught at junior or middle school in most countries, so you are probably familiar with them already! If not, don't worry, they won't take much time to learn and understand.
Visualisations
Data scientists must be able to visually present their data and findings. Therefore, you should be aware of all the different types of plots to know the best way to present your results to stakeholders and colleagues.
The key ones to know are:
- Bar Chart
- Line Graph
- Pie Chart
- Scatter Plot
- Violin Plots
- Histogram
- Frequency Diagrams
- Box and Whisker Plot
- Heat-maps and Contours
You can do more fancy things with these in combination, like stacked bar charts, but once you know the basics, these are straightforward to produce.
It will also be useful to learn how to produce these plots in Python using packages like Plotly and Matplotlib.
Probability Distributions
When people think of statistics, they will probably think of distributions. Probability distributions help us describe statistical events and understand the frequency and probability of specific outcomes from these events.
Their primary critical use in data science is to help us understand the type of relationship between our target variable and its dependent variables (commonly known as features). This ensures we choose the most suitable model for the task to maximize performance.
The most important ones to know are:
- Normal Distribution
- Poisson Distribution
- Binomial Distribution
- Gamma Distribution
- Exponential Distribution
- T-Distribution
- Chi-Square Distribution
- Probability Density Function
- Cumulative Distribution Function
There are many, and I mean many, probability distributions out there. However, the ones listed above are the ones you will come across and use most frequently, based on my personal experience. Particularly if you are working in the insurance industry!
Probability Theory
Probability theory encompasses the whole mathematical modeling side of how probability works. This large area is sometimes referenced as separate from statistics due to its size and overlaps with some of the places I will or have discussed in this article thus far.
In fact, the probability distribution section technically falls inside probability theory, but I gave it a unique, separate section due to its brevity.
Anyway, the key things to know about probability theory are:
- Random Variables
- State Spaces
- Samples, Populations, and Standard Error
- Central Limit Theorem
- Law of Large Numbers
- Maximum Likelihood Estimation (very important for Machine Learning, as it explains where loss functions come from)
Probability theory is a big field, but the list above is particularly valuable information for data science and machine learning.
Hypothesis Testing
How do you know if a result is significant or just some random noise?
Well statistical testing, commonly known as hypothesis testing is used to determine this. Perhaps the most famous example in the data world is an AB test, which is used constantly by literally every company nowadays.
To fully understand hypothesis testing, you need to know the following concepts:
- Confidence and Prediction Intervals
- Significance Levels, Critical Values, and P-values
- One vs Two-Tailed Test
- Null and Alternative Hypotheses
- Test Statistics
- Z-Test
- T-Test
- Chi-Square Tests (both types)
- ANOVA Test
Just learn the process and intuition behind how hypothesis testing is carried out. You also should understand when to use certain tests over others in different scenarios.
Regression Analysis
The first algorithm a data scientist typically learns about is linear regression. Despite machine learning being a relatively new field, linear regression dates back to the early 1800s, so it is quite an old statistical technique.
Linear regression is part of a broader area called regression analysis, which is all about estimating the relationship between a target variable and a set of features (known was covariates).
Many models and methods in regression analysis are still used today and are very effective.
As well as regular linear regression, you should learn the following:
- Multivariate Linear Regression
- Polynomial Regression
- Generalized Linear Models
- Generalized Additive Models
- Gauss-Markov Theorem
- Ordinary and Least Squares Estimation
Bayesian Statistics
There are two main branches and ways of thinking in statistics: frequentist and Bayesian. Most people "do" statistics in a frequentist framework, as that's the way most people are taught in schools. It's also a more accessible system to work with.
However, biologically, humans tend to think and work in a Bayesian mindset, and so this theory of statistics has been successfully applied in many experiments in the data science field. Many optimization algorithms in machine learning rely on a Bayesian approach instead of a frequentist one.
I believe Bayesian statistics is an important domain to learn, as many people use the term "Bayesian" loosely and most likely don't know what it truly means.
The key things you should cover are:
- Marginal, Joint, and Conditional Probability
- Bayes' Theorem
- Bayes' Factor
- Conjugate Priors
- Bayesian Updating
- Credible Intervals
- Bayesian Regression
Stochastic Process
This area is optional for an entry-level position, but you will likely use at least one point in your data science career. So, it is well worth knowing about if you have the time in your learning road map.
A stochastic process is a sequence of events (typically time-indexed) of random variables. Stochastic processes model many phenomena, such as how water molecules move to the stock market price.
The following areas are the fundamentals of stochastic processes:
- Markov Property
- Markov Chains
- Hidden Markov Models
- Random Walks
- Geometric Brownian Motion
- Ito Calculus (This is quite advanced)
Resources
Now, there are endless resources to learn about all the topics I listed above. There are even volumes of textbooks for each sub-domain. However, we just need to know the general gist for entry-level data science positions and the main intuition behind them. We certainly don't need a PhD level of understanding (although that would certainly help!).
Textbooks
In terms of textbooks, my go-to would be the Practical Statistics for Data Science textbook, which covers everything, is specifically designed for data scientists, and gives concrete examples using Python, which is very useful.
Another great and more advanced option would be An Introduction to Statistical Learning, although it is pretty mathematically dense.
An Introduction to Statistical Learning: with Applications in Python (Springer Texts in Statistics)
For Bayesian statistics, I recommend Think Bayes by Allen Downey. This is the book I used to learn Bayesian statistics, and it's indeed a treasure trove – not to mention free online!
Courses
If courses are more your cup of tea, there are endless possibilities. I recommend trying the following from Coursera:
Probability & Statistics for Machine Learning & Data Science
It doesn't matter which one you pick as long as it covers a good chunk of the abovementioned things!
Others
There are other valuable platforms out there that you can check out to enhance your learning.
W3Schools:
Brilliant:
Khan Academy:
These learning platforms are excellent reference texts to supplement any textbook or course you are working through. They can also be used on their own if you want!
My Posts
I have also written several articles in the above areas. These blogs will give you a good overview of all the concepts in a digestible way!
Summary & Further Thoughts
Statistics is arguably the most essential and used field within data science. However, it's enormous and may seem overwhelming at the beginning for someone looking to learn the field. For entry-level data scientists, you don't need to know everything to a PhD level or depth of understanding. Having excellent working knowledge and strong intuition is enough, in my opinion. I hope this article has helped you design your statistics learning roadmap and find valuable resources!
Another Thing!
I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.