Make calculations in Two Way ANOVA test with Pandas
In the area of hypothesis testing, we have one parametric test called ANOVA (Analysis Of Variance) which have three variants depends on data:
One Way ANOVA
Two Way ANOVA
Two Way ANOVA with replication
Each of these tests is used in dedicated condition. His assumption are:
Normally distributed data
Equality of variance between data
To perform it well, we generally have five step to follow:
Step 1: Hypothesis formulation
Step 2: Choice of probability law
Step 3: Compute observation values or reference values
Step 4: Determine the critical values
Step 5: Make conclusion
In this post, we'll use Pandas to compute ANOVA Two Way parameter in step 3. We will make a demonstration on the following data representing the yields of three varieties of maize using four different kinds of fertilizers. We want to test whether the variation in yields is caused by the different varieties of maize, different kinds of fertilizers or differences in both.
| Variety_1 | Variety_2 | Variety_3 |
Type_1 | 64 | 72 | 74 |
Type_2 | 55 | 57 | 47 |
Type_3 | 59 | 66 | 58 |
Type_4 | 58 | 57 | 53 |
Before going further, let's remember the formula:
Now, we can start write our python code to solve our problem.
1. Correlation factor
Before calculating the correlation factor, let's compute first the sum of the column and the sum of the row, and finally the total of all our data.
NB: The following manipulation supposes we already load our data and put it in a variable called "data".
| Variety_1 | Variety_2 | Variety_3 | Ti. |
Type_1 | 64 | 72 | 74 | 210 |
Type_2 | 55 | 57 | 47 | 159 |
Type_3 | 59 | 66 | 58 | 183 |
Type_4 | 58 | 57 | 53 | 168 |
T.j | 236 | 252 | 232 | 720 |
1.1.Sum of column
variety_sum = data.sum()
Output:
variety_1 236
variety_2 252
variety_3 232
dtype: int64
The method sum is used to return the sum of pandas Series/DataFrame over the y-axis.
1.2.Sum of row
type_sum = data.sum(axis=True)
Output:
Type_1 210
Type_2 159
Type_3 183
Type_4 168
dtype: int64
The method sum(axis=True) in this case return the sum of pandas series/dataframe over the x-axis
With the type_sum and variety_sum, we can now compute the correlation factor:
1.3. Sum of all data
As we have the sum of rows and sum of the column, it's now easy for us to calculate the total of data.
type_sum.sum() or variety_sum.sum()
Output:
720
As type_sum and variety_sum are vectors, call pandas sum function on their return a single value represents the summation of the element.
1.4. Number of column and rows of data
we need to store the number of rows and columns of our data to use them on our following computation. These values will be extract from pandas shape function.
#Number of rows of data
nbre_row = data.shape[0]
#Number of column of data
nbre_column = data.shape[1]
1.5.Correlation factor
To calculate it, we just need to apply the formula.
correlation_factor = type_sum.sum()**2/(nbre_column*nbre_row)
2.Total sum of square
sst = (data**2).sum().sum() - correlation_factor
The expression data**2 is used to put each value in data at square, (data**2).sum() calculate the sum of all values over y-axis (the column) and (data**2).sum().sum() return the total of summation of all data.
3. Complete code
def compute_anova_parameter(data):
# Compute the sum of all data in column
variety_sum = data.sum()
#compute the sum of all data in row
type_sum = data.sum(axis=True)
#NUmber of ligne of data
nbre_row = data.shape[0]
#Number of column of data
nbre_column = data.shape[1]
#Correlation Factor
correlation_factor = type_sum.sum()**2/(nbre_column*nbre_row)
# Total sum of square
sst = (data**2).sum().sum() - correlation_factor
# Total sum of square of row effect
ssr = (type_sum**2).sum()/nbre_column - correlation_factor
# Total sum of squares of column effect
ssc = (variety_sum**2).sum()/nbre_row - correlation_factor
# Sum square Error
sse = sst-ssc-ssr
# Mean Square Column
msc = ssc/(nbre_column-1)
# Mean square Row
msr = ssr/(nbre_row-1)
#Mean Square Error
mse = sse/((nbre_column-1)*(nbre_row-1))
# Calculation of Fisher parameter
Fc = round(msc/mse,3)
Fr = round(msr/mse,3)
return {"Fc":Fc, "Fr":Fr}
4. Testing
We have our in excel format as follow:
we can load and use it.
import pandas as pd
data = pd.read_excel('fertlizer.xlsx', index_col=0)
print(compute_anova_parameter(data))
output : {'Fc': 1.556, 'Fr': 9.222}
You can add more data in the excel file as you want and the program will compute it. The final values will be used in step 5 to make a conclusion of hypothesis testing