Jihed 503
- Feb 23, 2022
- 5 min read

Think statistically

Statistical concepts for Data Science

Understanding fundamental statistical concepts is essential for doing data science. In this article, we will discover together some important notions used by data scientists.

Population and sample
Binomial distributions
Poisson distribution
Normal distributions
Measures of central tendency
Measures of dispersion
Covariance and correlation
Percentiles, outliers, and box plots
Empirical Cumulative Distribution Function
Bootsrap

Libraries used in this article

import numpy as np
import pandas as pd
from scipy.stats import norm, binom
import matplotlib.pyplot as plt
import statistics
import seaborn as sns
from scipy.special import factorial

Population and sample

When we want to study a large set of data, we can extract a specific group and analyse it. That will be easier and help us obtain a general and approximate idea of the whole population. So the sample is the specific group of elements used to draw conclusions about the population. Obviously, the size of the sample is less than the total size of the population. For instance, to measure the variation of the height of people in a country we should select a sample in order to facilitate the operation.

Binomial distribution

The binomial distribution is a probability distribution of the number of successes in a sequence of independent trials. For example, we want to know how many times we get head if we flip 3 coins 10 times. The below code calculates the total number of heads from each set of flips of a perfect coin.

binom.rvs(3, .5, size=10)

output: array([2, 3, 0, 1, 2, 2, 2, 2, 3, 2])

Poisson distribution

A poisson precess is a process where events appear to happen at a certain rate, but completely at random. For example, The number of cases of a disease in different towns is a poisson process. The poisson distribution is described by a value called lambda which represents the average number of events per time period. The poisson distribution's shape varies with lambda.

t = np.arange(0, 10, 0.1)
d1 = np.exp(-1)*np.power(1, t)/factorial(t)
d5 = np.exp(-5)*np.power(5, t)/factorial(t)
d8 = np.exp(-8)*np.power(8, t)/factorial(t)

plt.plot(t, d1)
plt.plot(t, d5)
plt.plot(t, d8)

plt.xlabel('Number of events')
plt.ylabel('Probability')
plt.legend(['labmda = 1', 'labmda = 5', 'labmda = 8'])

plt.show()

Normal distribution

Normal distribution is also known as Gaussian Distribution. It's one of the most important probability distributions that is used to model many real world situations. The probability density function is shown below. The graph appears as a bell curve. It is symmetrical. The normal distribution is described by its mean and standard deviation. Here is a normal distribution with a mean of 0 and standard deviation of 1.

sns.set_style("darkgrid")
fig, ax = plt.subplots(1, 1)
mean, var, skew, kurt = norm.stats(moments='mvsk')
x = np.linspace(-2, 2, 100)
ax.plot(x, norm.pdf(x),
       'r-', lw=5, alpha=0.6, label='norm pdf')
plt.show()
print(var)

Measures of central tendency

For this task, we will use a kaggle dataset. It contains the height (inches) and weights (pounds) of 25,000 different humans of 18 years of age

df = pd.read_csv("SOCR-HeightWeight.csv")
heights = df["Height(Inches)"] # np.round(df["Height(Inches)"]).astype(int) #floor the heights series
plt.hist(heights, color = "red", lw=2)
plt.xlabel('Height(Inches)')
plt.ylabel('Frequency')

output: Text(0, 0.5, 'Frequency')

Typically, when we want to measure the center value of a data, we use either mean, median or mode. The mean, often called average, is the most common ways of summarizing data. To calculate the mean we need to sum up all the numbers and divide by the total number of data points.

mean = np.mean(heights)
print("The heights mean is {} inches".format(mean))

output: The heights mean is 67.99311359679979 inches

Another measure of center is the median. The median is the value where 50% of the data is higher than it and 50% of the data is lower. We can calculate that by sorting all values and taking the one of the middle index.

heights = heights.sort_values()#inplace=True)
heights.reset_index(drop=True, inplace=True,)
middle_index = int(len(heights)/2)
median = heights.iloc[middle_index]
print("The calculated median is: {}\nThe numpy median function result: {}".format(median, int(np.median(heights))))

output: The calculated median is: 67.99592 The numpy median function result: 67

The mode is the most frequent value in the data.

print(heights.value_counts())
print("The most frequent height is {}".format(statistics.mode(heights)))

Measures of dispersion

In order to better understand our data, we need to have an idea about measures of spread. Spread describes how spread apart or close together the data points are. We will begin with the first measure of spread: Variance which is the average distance from each data point to the data's mean.

dists = heights - np.mean(heights)
sq_dists = dists ** 2
sum_sq_dists = np.sum(sq_dists)
n = len(heights)
variance = sum_sq_dists / (n-1)
print("The calculated variance = {}\nThe numpy variance = {}" \
      .format(variance, np.var(heights, ddof=1)))
# Without ddof=1, population variance is calculated instead of sample variance

output: The calculated variance = 3.616382148854081 The numpy variance = 3.6163821488540733

The unit of the variance is squared, so we need an easier measurment of dispersion: The standard deviation. It is simply the square root of the variance.

print("STD: {}".format(np.sqrt(variance)))

output: STD: 1.9016787712056105

Another measure of despersion similar to the standard deviation Mean Absolute Deviation. It takes the absolute value of the distances to the mean and then takes the mean of those differences.

dist = heights - np.mean(heights)
np.mean(np.abs(dists))

output: 1.5179448356838294

Covariance and correlation

Let us now visualize the relationship between height and weight in our dataset. We will need a scatter plot.

weights = df["Weight(Pounds)"] # np.round(df["Weight(Pounds)"]).astype(int)
_ = plt.plot(heights, weights, marker='.', linestyle='none', color="red", alpha=1)
plt.xlabel('Height(Inches)')
plt.ylabel('Weight(Pounds)')

output: Text(0, 0.5, 'Weight(Pounds)')

Now, We need a number that summerizes how height varies with weight. One such statistics is the covariance. If the covariance is positive, our two variables are positivly correlated. Conversly, if the covariance is negative, our two variables are negatively correlated. As for the case of variance, we need a numbre that doesn't have any unit so we can divide the covariance by the product of the std of height and weight. Hence, we obtain the Pearson correlation coefficient.

covariance_matrix = np.cov(heights, weights)
print(covariance_matrix)

output: [[ 3.61638215 -0.17537673] [ -0.17537673 135.97653199]]

print("Covariance: {}".format(covariance_matrix[0,1]))

output: Covariance: -0.17537672618776684

corr_mat = np.corrcoef(heights, weights)
print("Correlation: {}".format(corr_mat[0,1]))
print(heights.corr(weights)) # 2nd method

output: Correlation: -0.007908658448136563 -0.007908658448136563

Here the pearson correlation coefficient is almost null. We can't say anything about the dependency of the two variables.

Percentiles, outliers, and box plots

Percentiles are useful summary statistics. It is a score below which a given percentage k of scores in its frequency distribution falls.

np.percentile(heights, [25, 50, 75])

output: array([66.7043975, 67.9957 , 69.2729575])

Let's move to graphical exploring and introduce the boxplots. They can show us many informations about the data, such as the median which is the 50th percentile, the edges of the box (the 25th and the 75th percentile). Points outside of the whiskers are plotted as individual points, they are also known as outliers. Usually, points being more than 2 IQRs (the length of the box) away from the median are outliers.

c = 'red'
sns.boxplot(x=heights, flierprops=dict(color=c, markeredgecolor=c),medianprops=dict(color=c))
plt.show()

Empirical Cumulative Distribution Function

In statistics, an empirical distribution function is the distribution function associated with the empirical measure of a sample. In the graph of an ECDF, the y value is the fraction of data points that have value smaller than the corresponding x value.

x = np.sort(weights)
y = np.arange(1, len(x)+1) / len(x)
plt.plot(x, y, marker='.', linestyle='none', color='red')
plt.xlabel('Weight')
plt.ylabel('ECDF')
plt.margins(0.02) # Keeps data off plot edges
plt.show()

Bootstrap

Bootstrapping replicate from a set of data refers to random sampling with replacement. Bootstrapping estimates the properties of an estimator (such as its mean) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the empirical distribution function of the observed data.

for i in range(50):
    bs_sample = np.random.choice(heights, size=len(heights))
    
    # Compute and plot ECDF from bootstrap sample
    x = np.sort(bs_sample)
    y = np.arange(1, len(x)+1) / len(x)
    
    _ = plt.plot(x, y, marker='.', linestyle='none',
                 color='grey', alpha=0.1)

# Compute and plot ECDF from original data
x = np.sort(heights)
y = np.arange(1, len(x)+1) / len(x)
_ = plt.plot(x, y, marker='.')


plt.margins(0.02)
_ = plt.xlabel('Height(Inches)')
_ = plt.ylabel('ECDF')

# Show the plot
plt.show()

datainsightonline.com

Data Scientist Program

Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!