3 Simple Statistical Methods for Outlier Detection

Author:Murphy  |  View: 24704  |  Time: 2025-03-22 21:10:42

3 Simple Statistical Methods for Outlier Detection

Outliers. Image by author

As we all know, a big part of a data scientist's job is to clean and preprocess data. A huge part of this involves outlier detection and removal. Large outliers, spikes and bad data can really interfere with training an accurate machine learning model, so it's important that outliers are handled properly.

But data scientists aren't always using machine learning models like isolation forest or local outlier factor to identify outliers. One thing I have learned in my Data Science career is that if a simpler solution works, use that.

I want to provide you with 3 simple statistical solutions for detecting outliers that work pretty well most of the time. I'll also show you how these are done in Python.

1. Z score

Z score, also known as standard score, is one of the more well-known methods of Outlier Detection. Essentially what it represents is how many standard deviations away from the mean a data point is.

The z score of a given data point in some dataset is calculated as follows:

Z score equation. Image by author

Where:

  • Z is the z score value
  • x is the data point
  • μ is the dataset mean
  • σ is the standard deviation

So if you get a z score of 4, this means that data point is 4 standard deviations above the mean. If you get a z score of -4, then it is 4 standard deviations below the mean.

In the case of z score, typically any value above 3 or below -3 is considered an outlier. However, this threshold is flexible and can be adjusted by the programmer.

Here's an easy way to calculate the z score of every value in a dataframe column using Python and the scipy.stats package:

from scipy import stats

df["z_score"]=stats.zscore(df["column_of_interest"])

Once you have the z score for every value in your dataset, you can filter out the outliers:

df_clean = df[(df["z_score"]<=3) & (df["z_score"]>=-3)]

One downside to z score is that despite being an outlier detection method, it is sensitive to outliers. So if there are very large outliers in the dataset, this can skew the mean (since the mean is sensitive to outliers). If the mean is skewed then it might not catch smaller (but still relevant) outliers.

2. IQR

IQR (Interquartile range) is more robust than the z score because it uses the median as a reference point rather than the mean.

To calculate the IQR of a dataset, you first need to calculate the Q1 and Q3 of the dataset. To find these, order the dataset from small numbers to large, then split the ordered dataset into 2 smaller ones. Find the median of each of the smaller datasets.

Q1 is the median of the first half, and Q3 is the median of the second half. The true median (Q2) is the middle of the whole thing.

Calculating IQR. Image by author

Now that you have your IQR value, it will be used as a reference point to determine if any other points are outliers. Any point that is 1.5 IQR or more below the Q1 value, or any point that is 1.5 IQR above the Q3 value is considered an outlier.

In this case, anything above 46.5 (27 + * 13 1.5) and below -5.5 (14 -13 * 1.5)** would be considered an outlier.

Here is how this would be calculated in Python, using numpy's percentile and scipy stats' IQR functions:

from scipy.stats import iqr
import numpy as np

# Get the IQR of the data
iqr_data = iqr(df["column_of_interest"])
# Calculate the reference point for getting out of range values
# (1.5 * IQR)
iqr_lim = 1.5 * iqr_data

# Calculate the upper (Q3 / 75th) and lower quartiles (Q1 / 25th)
q1 = np.percentile(df["column_of_interest"], 25)
q3 = np.percentile(df["column_of_interest"], 75)

# Use quartiles and IQR*1.5 to determine upper and lower limits
upper_limit = q3 + iqr_lim
lower_limit = q1 - iqr_lim

# Filter out outliers - those who are below the lower limit,
# or above the upper limit
df_clean_iqr = df[(df["column_of_interest"]>=lower_limit) 
& (df["column_of_interest"]<=upper_limit)]

As you can see, calculating IQR for a dataset requires a few more steps/lines of code than z score. You also don't get a visible "score" for each data point which tells you just how large of an outlier it is. Here you really only know whether something is considered out of range or not.

However, it is less sensitive to outliers in the dataset because it uses the median, which isn't skewed as easily as the mean.

3. Modified z score

Modified z score takes aspects of both z score and IQR to create a more robust version of the standard z score. It provides the benefits of both z score and IQR by providing a score which tells you approximately how "far out" a data point is, but also is not as sensitive to outliers.

The main difference between z score and modified z score is that modified z score uses the median instead of the mean as a reference point. Since standard deviation is directly related to the mean, modified z score does not measure exact standard deviations. However, it attempts to approximate standard deviations by using median absolute deviations (MAD).

Modified z score equation. Image by author

Here,

  • 0.6745 is a constant used to approximate a median-equivalent of standard deviation
  • xi is the data point you are looking at
  • x͂ is the median of the dataset
  • MAD is the median absolute deviation of the dataset

*To calculate median absolute deviation, simply subtract each data point from the median of the dataset. Take the absolute value of each subtraction. Finally, take the median of all of these absolute differences.

Typically, with modified z score, values with a score of > 3.5 or < -3.5 are considered outliers.

For a more in depth information on how modified z score is calculated and a real-world example in Python, see my article on modified z score:

Modified z-score: A robust and efficient way to detect outliers in Python

Here is how modified z score is implemented in Python:

# This function takes a value and a dataset and returns
# the modified z score for a single value.
def compute_mod_z_score(value,df):
    # Calculate the MAD of the dataset (column of interest)
    med_abs_dev = (np.abs(df["column_of_interest"] - 
                  df["column_of_interest"].median())).median()
    const = 0.6745
    mod_z = (const * (value - df["column_of_interest"].median()) 
            / med_abs_dev)
    return mod_z

# Apply the above function to the entire column to get a modified
# z score for every data point.
df["mod_zscore"]=df["column_of_interest"].apply(compute_mod_z_score,df=df)

The main downside to modified z score is that it is a bit more complex and more difficult to explain, due to the fact that it is less well known and uses variables such as the MAD which are also less well known. There is also no Python library (that I know of) which calculates the modified z score.

Conclusion

As you can see, each statistical method of outlier detection has its benefits and drawbacks. At work, I have used all of these, but for different data sets and use cases. I can't stress enough how important it is to explore your data so that you know how to approach your problem.

For example, if your dataset is prone to spikes, modified z score or IQR may be the best option. If you're looking for the simplest and most easily explainable solution, z score / standard score will be the way to go.

Always make sure that you run your own tests, consult with other data scientists, and with any relevant stakeholders who could benefit from your results.


Thanks for reading

Get an email whenever Haden Pelletier publishes.

Tags: Data Analysis Data Science Machine Learning Outlier Detection Statistics

Comment