Correlation and Covariance

abdelrahman.shaban7000
Feb 28, 2022
2 min read

Whenever we talk about statistics, we should mention correlation and covariance. In this article, we will go through them to illustrate each concept with some examples.

Covariance:

It shows how two variables differ or the direction of the linear relationship. Its value can range from -∞,+∞, so a positive value indicates a positive relationship and a negative value indicates a negative relationship, and If the two variables are independent, So their covariance will be 0. So when we have a positive number we can conclude that there is a direct relationship, but that does not mean the dependency of one variable on another one.

Its formula:

· xi = data value of x

· yi = data value of y

· x̄ = mean of x

· ȳ = mean of y

· N = number of data values.

It is important to mention that the covariance is affected the variances of the variables.

x=[2,4,6,9,14]
y=[32,64,72,81,172]
np.cov(x,y)[0][1]

235.5

Correlation:

It is used to study the strength of a linear relationship between the variables, so unlike covariance, the correlation can be used to compare how strong or weak the relationship is because of its magnitude.

there are different methods that we can use to quantify the correlation. we will use the Pearson correlation coefficient.

Its formula:

We can say that the variables are positively correlated when the two variables move in the same direction. And they are negatively correlated when they move in the opposite directions.

the following are shapes for the relationships between variables and their correlation values based on that.

It is important to remember the common phrase "correlation does not imply causation" and that because there is a third factor can affect both of them.

Notice that you can always compute the correlation even if the relationship is not linear. So before we compute the correlation we should always check the scatterplot to see if the variables are linearly related. And if they are not linearly related, we should not use the correlation or Pearson correlation coefficient.

Here some examples in code:

x = np.arange(10, 20)
y = np.array([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])
print(x)
print(y)

[10 11 12 13 14 15 16 17 18 19]
[ 2  1  4  5  8 12 18 25 96 48]

r = np.corrcoef(x, y)
print(r)

array([[1.        , 0.75864029],
       [0.75864029, 1.        ]])

Another example using pandas:

x_col = pd.Series(range(10, 20))
y_col= pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])
x_col.corr(y_col)

0.7586402890911867

There is a difference between correlation and association as the strongly correlated items are strongly associated but not vice versa.

Resources used: here

That was part of Data Insight's Data Scientist program.

datainsightonline.com

Data Scientist Program

Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

Correlation and Covariance

Recent Posts

Comments

40 Python Projects with Source Code for Beginners

How to Read Medium Premium Articles for Free

How to use Sqlite3 using Python

Data Visualization - which types of graphs should we use?

Best Online Courses for Data Science

9 Ways to Embed Code Snippets on your Data Science Blog Posts