An Introduction to Covariance and Correlation

Introduction
Understanding associations between variables is crucial for building accurate models and making informed decisions. Statistics can be a messy business; full of noise and random variation. Yet, by identifying the patterns and connections between variables, we can draw insights into how varying features influence each other. For the data scientist and data analyst, such associations are exceedingly useful, particularly when it comes to analysis and model building.
With this in mind, covariance and Correlation are two fundamental statistical concepts that describe the relationship between variables. Though they are similar in nature, they do differ in how each characterizes associations. But, as we'll discover shortly, these differences are more cosmetic than they are substantive, and they're really just different sides of the same coin. So today we'll explore what covariance and correlation are, how they are calculated, and what they mean.
Covariance
To motivate this discussion, suppose we have two random variables X and Y that we're particularly interested in. I'm not going to make any specific assumptions about how they're distributed except to say they are jointly distributed according to some function f(x, y). In such cases, it's interesting to consider the extent to which X and Y vary together, and this is precisely what Covariance measures: it is a measure of the joint variability of two random variables.
If we treat X and Y as continuous random variables then the covariance may be expressed as:

The integrals here make this equation look more intimidating than it actually is, and all that's happening here is that an average is being computed over the joint space. This fact can be made clearer by using the expected value operator, E[⋅], which produces a more palatable mathematical expression for covariance:

So, what we can see here is that covariance is an expectation (or average) taken over the product of the mean-centered X and Y variables. In fact, this can be simplified even further because expectations have rather nice linearity properties:

We can now see that the covariance is just the mean of the product of the variables minus the product of their means. Also, here's a fun fact: the variance is a special case of covariance and is simply the covariance of a variable with itself:

Fundamentally, covariance is a property of any joint probability distribution and is a population parameter in its own right. This means that if we only have a sample of X and Y, we could compute the sample covariance using the following formula:

Okay, but what does covariance mean, in practice?
Simply, covariance measures the extent to which the values of one variable are related to the values of another variable, which can either be positive or negative. A positive covariance indicates that the two variables tend to move in the same direction. For example, if large values of X tend to coincide with the large values of Y, then the covariance is positive. The same applies if lower values coincide, too. However, a negative covariance indicates that values tend to move in opposite directions: this would occur if large values of X correspond with low values of Y, for example.
A useful property of covariance is that its sign describes the tendency of the linear relationship between X and Y. That being said, the actual units it's expressed in are somewhat less useful. Recall that we're taking products between X and Y so the measure itself is also in units of X × Y. This can make comparisons between data difficult because the scale of measurement matters.
Correlation
What we most often refer to as correlation is measured using Pearson's product-moment correlation coefficient, which is conventionally denoted using ρ. Now, if you were thinking that covariance sounds a lot like correlation, you're not wrong. And that's because the correlation coefficient is just a normalized version of the covariance, where the normalizing factor is the product of the standard deviations:

We can also estimate the correlation coefficient from data using the following formula:

The upshot of this normalization is that the correlation coefficient can only take on values between -1 and 1, with -1 indicating a perfect negative correlation, 1 indicating a perfect positive correlation, and 0 denoting no correlation. In this way, it measures both the strength and direction of the relationship between two variables. What's nice about this is that the correlation coefficient is a standardized measure, which means that it is agnostic about the scale of the variables involved. This solves an intrinsic issue with covariance, making it much easier to compare correlations between different sets of variables.
However, while the correlation coefficient estimates the strength of a relationship, it cannot fully characterize the data. Anscombe's quartet provides a very good example of this, showing how different patterns in data yield identical correlation coefficients. Ultimately, Pearson's correlation coefficient only provides a full characterization if the data are multivariate normal. If this is not true, then the correlation coefficient is only indicative and needs to be considered along with a visual inspection of the data.
Covariance, Correlation, & Independence
Let's suppose that the random variables X and Y are statistically independent. Under the independence assumption, it follows that the expected value of X and Y is:

If we plug this into the expression for the covariance we find that

Therefore, random variables that are independent have zero covariance, which further implies that these variables are uncorrelated. However, if we find that two variables are uncorrelated – i.e., they have a correlation coefficient of zero – we cannot necessarily assume that they are independent. Fundamentally, covariance and correlation measure linear dependency, so all we can say is that the variables are not linearly related. It's entirely possible that the variables are non-linearly related, but covariance and correlation cannot detect these types of relationships.
To illustrate this fact we can lean on a classic counterexample that goes as follows. Suppose X is a random variable that has some distribution f(x) that is symmetric around zero. This implies that for all x we have that f(-x) = f(x) which further implies the following is true:

Given this symmetry condition, the expectation of X is therefore:

If we now create a dependency between X and Y such that Y = X² then we know what Y must be for any given value of X. However, if we examine the covariance between X and Y we find that:

What this demonstrates is that, while X and Y are clearly dependent, the covariance is zero because the relationship is non-linear. There is one special case that you should be aware of, though. If X and Y are each normally distributed variables then a correlation coefficient of zero does imply independence.
Related Content
Thanks for reading!
If you enjoyed this post and would like to stay up to date then please consider following me on Medium. This will ensure you don't miss out on any new content.
To get unlimited access to all content consider signing up for a Medium subscription.
You can also follow me on Twitter, LinkedIn, or check out my GitHub if that's more your thing