top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Cross Correlation with Two Time Series in Python

Ever wanted to check the degree of synchrony between two concepts over time? Put differently, how does a given concept X correlate with another concept Y, both of which happen across the same time interval and period? For instance, how does the search for, say, IELTS on Google move in relation to the number of people who actually registered for the exam in the same time period.


As seen, this is a problem of relationship. However, using a simple correlation coefficient to state the relationship will not fully utilize the data and thus hide some very important information. Why? A simple correlation coefficient, say Pearson, will compare pairs of the two data across the same time point: count of searches of IELTS on, say, 13/02/2020 against the count of actual IELTS registration on that same day, and so on, returning just one correlation coefficient.

In this article, we will briefly discuss why a single correlation coefficient may not be effective in this scenario. We will then proceed with implementing a time-lagged cross correlation in Python. Lastly we will recommend further steps to take based on the goal of your analysis.


Table of Content:

  • Limitation of correlation in presence of possible lead-lag relationship

  • How to implement cross correlation in Python

  • Interpretations and further steps

Limitation of Simple Correlation

In the presence of a possible lead-lag relationship, there is a loophole in just estimating relationship between two times series with a single coefficient.


Now what is this loophole? It is possible that a prospective candidate searches for the exam yesterday but registers a week later. It is also possible the student already registered but later needed information about the exam and searched through Google. These two possible omissions can be catered for by using a time-lagged cross correlation analysis instead.


This method holds one of the series in place, usually the dependent variable, and creates both lags and leads of the second variable across the time period before computing the correlation coefficients. That way you can compare one series against the other at different times and have a more holistic understanding of the trend. That way you can talk about lag and lead relationships that are missing in the single coefficient method as noted earlier.


Cross Correlation in Python

We will explore this concept with a simple example data. We will generate the IELTS search data from Google Trends. Then we will the Numpy library to generate random range of number to serve as our hypothetical IELTS registration data. The following section show the code implementation:


First we import the libraries and the data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("IELTS-Nigeria.csv")

Then we format the week column to be a proper datetime and also generate the random data that will represent the IELTS registration count and then have a view of the data

df["week"] = pd.to_datetime(df["week"], format ="%d/%m/%Y").dt.date
df["reg_ielts"] = pd.Series(np.random.randint(100, 220, size = len(df)))
df.head()

We can then use line plots to see the trend of the two time series as shown below the code:

fig = plt.figure(figsize = (8, 6))
sns.lineplot(data = df, x = 'week', y ='search_ielts')
sns.lineplot(data = df, x = 'week', y ='reg_ielts')
plt.title('Search vs Registeration of IELTS', weight='bold', fontsize = 15)
plt.ylabel('IELTS Search and Registration Count', weight='bold', fontsize = 12)
plt.xlabel('Week', weight='bold', fontsize = 12)
plt.show()

The two datasets look somewhat stationary, that is they have no obvious trend or show any striking seasonality. Next, let's have a graphical idea of the relationship between them with the aid of a scatter plot

fig = plt.figure(figsize = (8, 6))
sns.regplot(data = df, x = 'search_ielts', y ='reg_ielts')
plt.title('Search vs Registeration of IELTS ', weight='bold', fontsize = 15)
plt.ylabel('Registeration Count', weight='bold', fontsize = 12)
plt.xlabel('Search Count ', weight='bold', fontsize = 12)
plt.show()

Now, we can see there is almost no correlation between the two series when we pair the data at the same time point, which is the case of the simple correlation earlier of. But sometimes, two time series move synchronously at different time point and as earlier stated too, it is possible people first search information about IELTS, then day(s) or even months(s) later register for it. And this information captured only by lagging, or as the case maybe, leading the values of either of the time series against the other. And cross correlation is the method that accounts for that.


However, before we proceed with the actual cross correlation, we need to first check if our data meet certain conditions. Since this is just a simple guide, we will focus one important condition which is the assumption of stationarity, that is the mean and variance of the two series are approximately constant and are not affected by time movement. While there are a couple of tests for checking this we will be using the common Augmented Dickey Fuller test.


Augmented Dickey Fuller Test

The null hypothesis for the ADF test is that the series in question is not stationary. So a rejection goes in favor of the alternative hypothesis of stationarity. The following Python code creates a function to run this test, and the result is presented in the table below:

from statsmodels.tsa.stattools import adfuller
def adf_test(timeseries):
    dftest = adfuller(timeseries, autolag='AIC')
    result = pd.Series(dftest[0:4], index=['Test         
    Statistic','P-value','Lags Used','No of Observations'])
    for key,value in dftest[4].items():
        result['Critical Value (%s)'%key] = value
    return result
    
    adf_table = df.drop('week', axis = 1)
    adf_table.apply(adf_test, axis = 0)

From the above table, we can see that both the p-values of both series is less than 0.05, so we can reject the null hypothesis and say the two series are stationary. Otherwise we would go ahead with detrending the data.

So now we can go ahead and generate the cross correlation coefficients as shown below:

from scipy import signal
def ccf_values(series1, series2):
    p = series1
    q = series2
    p = (p - np.mean(p)) / (np.std(p) * len(p))
    q = (q - np.mean(q)) / (np.std(q))  
    c = np.correlate(p, q, 'full')
    return c
    
ccf_ielts = ccf_values(df['search_ielts'], df['reg_ielts'])
ccf_ielts

Lastly, we will create a list of of our lag values and visualize it against the correlation coefficients. We will also set the confidence interval outside which the correlation coefficients becomes important.

lags = signal.correlation_lags(len(df['search_ielts']), len(df['reg_ielts']))

def ccf_plot(lags, ccf):
    fig, ax =plt.subplots(figsize=(9, 6))
    ax.plot(lags, ccf)
    ax.axhline(-2/np.sqrt(23), color='red', label='5% 
    confidence interval')
    ax.axhline(2/np.sqrt(23), color='red')
    ax.axvline(x = 0, color = 'black', lw = 1)
    ax.axhline(y = 0, color = 'black', lw = 1)
    ax.axhline(y = np.max(ccf), color = 'blue', lw = 1, 
    linestyle='--', label = 'highest +/- correlation')
    ax.axhline(y = np.min(ccf), color = 'blue', lw = 1, 
    linestyle='--')
    ax.set(ylim = [-1, 1])
    ax.set_title('Cross Correation IElTS Search and 
    Registeration Count', weight='bold', fontsize = 15)
    ax.set_ylabel('Correlation Coefficients', weight='bold', 
    fontsize = 12)
    ax.set_xlabel('Time Lags', weight='bold', fontsize = 12)
    plt.legend()
    
ccf_plot(lags, ccf_ielts)

Interpretation and Further Steps

First, search_ielts is the X and the variable that is being shifted. The coefficient values on the left of zero are those where X leads and Y lags while the ones on the right are when Y leads and X lags.


The highest positive correlation coefficient (the point at which the plot touches the dashed horizontal blue line) is +0.32. Let's say this happened at the -8 lag, which is equivalent to lagging variable X 8 weeks behind Y. To interpret this, we would say there is a weak positive correlation between people searching IELTS today and registering 8 weeks later.


On the other hand, if the lag above was positive, we would say, given the correlation, people tend to check check information about IELTS 8 weeks after registering for the exam.


Another important step that could be taken after this is to check how significant the correlation coefficients are, and one popular method that is usually employed is the Granger Causality Test. While the test in itself does not mean causation, it is a better estimator of how both variables relate with each other.


Thanks for reading.

You can follow me on LinkedIn and Twitter






































0 comments

Recent Posts

See All

Comments


bottom of page