An Introduction to Covariance and Correlation

Author:Murphy | View: 21847 | Time: 2025-03-23 19:12:14

Introduction

Understanding associations between variables is crucial for building accurate models and making informed decisions. Statistics can be a messy business; full of noise and random variation. Yet, by identifying the patterns and connections between variables, we can draw insights into how varying features influence each other. For the data scientist and data analyst, such associations are exceedingly useful, particularly when it comes to analysis and model building.

With this in mind, covariance and Correlation are two fundamental statistical concepts that describe the relationship between variables. Though they are similar in nature, they do differ in how each characterizes associations. But, as we'll discover shortly, these differences are more cosmetic than they are substantive, and they're really just different sides of the same coin. So today we'll explore what covariance and correlation are, how they are calculated, and what they mean.

Covariance

To motivate this discussion, suppose we have two random variables X and Y that we're particularly interested in. I'm not going to make any specific assumptions about how they're distributed except to say they are jointly distributed according to some function f(x, y). In such cases, it's interesting to consider the extent to which X and Y vary together, and this is precisely what Covariance measures: it is a measure of the joint variability of two random variables.

If we treat X and Y as continuous random variables then the covariance may be expressed as:

Definition of covariance (image by author).

The integrals here make this equation look more intimidating than it actually is, and all that's happening here is that an average is being computed over the joint space. This fact can be made clearer by using the expected value operator, E[⋅], which produces a more palatable mathematical expression for covariance:

Definition for covariance using expectation operator (image by author).

So, what we can see here is that covariance is an expectation (or average) taken over the product of the mean-centered X and Y variables. In fact, this can be simplified even further because expectations have rather nice linearity properties:

Using linearity of expectations to further simplify the definition of covariance (image by author).

We can now see that the covariance is just the mean of the product of the variables minus the product of their means. Also, here's a fun fact: the variance is a special case of covariance and is simply the covariance of a variable with itself:

The covariance of a variable with itself is just the variance (image by author).

Fundamentally, covariance is a property of any joint probability distribution and is a population parameter in its own right. This means that if we only have a sample of X and Y, we could compute the sample covariance using the following formula:

Okay, but what does covariance mean, in practice?

Simply, covariance measures the extent to which the values of one variable are related to the values of another variable, which can either be positive or negative. A positive covariance indicates that the two variables tend to move in the same direction. For example, if large values of X tend to coincide with the large values of Y, then the covariance is positive. The same applies if lower values coincide, too. However, a negative covariance indicates that values tend to move in opposite directions: this would occur if large values of X correspond with low values of Y, for example.

A useful property of covariance is that its sign describes the tendency of the linear relationship between X and Y. That being said, the actual units it's expressed in are somewhat less useful. Recall that we're taking products between X and Y so the measure itself is also in units of X × Y. This can make comparisons between data difficult because the scale of measurement matters.

Correlation

What we most often refer to as correlation is measured using Pearson's product-moment correlation coefficient, which is conventionally denoted using ρ. Now, if you were thinking that covariance sounds a lot like correlation, you're not wrong. And that's because the correlation coefficient is just a normalized version of the covariance, where the normalizing factor is the product of the standard deviations:

We can also estimate the correlation coefficient from data using the following formula:

Sample correlation coefficient (image by author).

The upshot of this normalization is that the correlation coefficient can only take on values between -1 and 1, with -1 indicating a perfect negative correlation, 1 indicating a perfect positive correlation, and 0 denoting no correlation. In this way, it measures both the strength and direction of the relationship between two variables. What's nice about this is that the correlation coefficient is a standardized measure, which means that it is agnostic about the scale of the variables involved. This solves an intrinsic issue with covariance, making it much easier to compare correlations between different sets of variables.

However, while the correlation coefficient estimates the strength of a relationship, it cannot fully characterize the data. Anscombe's quartet provides a very good example of this, showing how different patterns in data yield identical correlation coefficients. Ultimately, Pearson's correlation coefficient only provides a full characterization if the data are multivariate normal. If this is not true, then the correlation coefficient is only indicative and needs to be considered along with a visual inspection of the data.

Covariance, Correlation, & Independence

Let's suppose that the random variables X and Y are statistically independent. Under the independence assumption, it follows that the expected value of X and Y is:

The expectation of two independent random variables (image by author).

If we plug this into the expression for the covariance we find that

Covariance of independent random variables (image by author).

Therefore, random variables that are independent have zero covariance, which further implies that these variables are uncorrelated. However, if we find that two variables are uncorrelated – i.e., they have a correlation coefficient of zero – we cannot necessarily assume that they are independent. Fundamentally, covariance and correlation measure linear dependency, so all we can say is that the variables are not linearly related. It's entirely possible that the variables are non-linearly related, but covariance and correlation cannot detect these types of relationships.

To illustrate this fact we can lean on a classic counterexample that goes as follows. Suppose X is a random variable that has some distribution f(x) that is symmetric around zero. This implies that for all x we have that f(-x) = f(x) which further implies the following is true:

The symmetry condition (image by author).

Given this symmetry condition, the expectation of X is therefore:

The expectation for a symmetric distribution (image by author).

If we now create a dependency between X and Y such that Y = X² then we know what Y must be for any given value of X. However, if we examine the covariance between X and Y we find that:

Covariance of two variables that are non-linearly related (image by author).

What this demonstrates is that, while X and Y are clearly dependent, the covariance is zero because the relationship is non-linear. There is one special case that you should be aware of, though. If X and Y are each normally distributed variables then a correlation coefficient of zero does imply independence.