top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Some Statistical concepts for data science




Statistique is one of the key parts of data science. In this blog post, we’ll present some important statistics concepts for data science such as mean, median, mode, variance, and so on.


Measures of Central Tendency

They are a single number that tries to define a data set by revealing the central position in that data set. Also, known as a central location, they are mean, median and mode and each of them is suitable for a specific use case.


a. Mean

The mean is a fundamental concept in mathematics and statistics. It’s the most popular and well-known measure of central tendency. It can be used for discrete and continuous data. The mean is the sum of all the values ​​in the data set divided by the number of values ​​in the data set. There are two types of arithmetic means: population means and sample mean. They are calculated in the same way, but written differently:



where N is the size of the population and n is the size of the sample.

What’s the difference between population and sample?

To be short and brief, a population is an entire group that you want to draw conclusions about while a sample is just the specific group that you will collect data.


In python, we don’t care about those formulas. Everything is done under the hood for us to easily calculate the mean. Either you use NumPy or pandas, they remain too simple.

### USING NUMPY
import numpy as np

data = np.array([65, 55, 89, 56, 35, 14, 56, 55, 87,45,92])
np.mean(data)

Output: 59.0

### USING PANDAS
import pandas as pd
pd.DataFrame(data).mean()

Output:
0    59.0
dtype: float64


b. Median

The median is a type of average used to find the middle value of a given list of data when arranging them in order. More formally, the median is the modality of the cumulative frequency at the position n/2 where n is the size of the sample. If n is an even number, then the median is the sum of the modality of cumulative frequency at the position n/2 and (n/2)+1 divided by 2.

One more time, python simplifies the work for us. using NumPy, we can calculate the median like:

## Numpy
np.median(data)

Output: 56.0

## Pandas
pd.DataFrame(data).mean()

Output:
0    56.0
dtype: float64

c. Mode

The mode is the value that has the most frequent apparition in the dataset. Unlike mean and median they can be multiple modes in the data set for any type of data.

As the mode can be found for any type of data type, NumPy doesn’t provide a direct method to find the mode. It’s because NumPy is designed to word with a number. To find the mode, we’ll do a little gymnastics.

values, counts = np.unique(data, return_counts=True)
index = np.argmax(counts)
values[index]

Output: 55

Here, we use NumPy the count the number of occurrences of each value in the dataset, then we use np.argmax to determine the index of maximum values in the array of counts, and finally, we return the mode.

This operation is pretty easy with pandas and it gives us two modes we have in our data.

pd.DataFrame(data).mode()

Variance

The variance is the measure of the distance of each variable from the mean in the data set. Otherwise, it’s the statistical measurement of the spread between values in the dataset. It’s measures variability from the mean and is calculated by taking the differences between each number in the data set and the mean, then squaring the differences to make them positive, and finally dividing the sum of the squares by the number of values in the data set. The formula is:


We can easily calculate variance using NumPy and pandas:

## Numpy
np.var(data)
Output: 514.1818181818181
## Pandas
pd.DataFrame(data).var()
Output: 
0    565.6
dtype: float64


Standard deviation

The standard deviation is the measure of how data are spread out. It’s calculated as the square root of variance. A low standard deviation indicates that data in a group around the mean and a high standard deviation show that data are more spread out.


We python, we calculate this as follow:

## Numpy
np.std(data)
Output: 22.675577571074527
## Pandas
pd.DataFrame(data).std()
Output:
0    23.782346
dtype: float64

Covariance

It’s the measure of the relationship between two random variables and helps to know if the two variables vary together. The covariance can be positive or negative where positive covariance indicates that the two variables tend to move in the same direction while negative covariance reveals that they tend to move in inverse directions. We calculate the covariance of a population using this formula:



To calculate the covariance of a sample, we use:


where:

  • Xi — the values of the X-variable

  • Yj — the values of the Y-variable

  • X̄ — the mean (average) of the X-variable

  • Ȳ — the mean (average) of the Y-variable

  • n — the number of data points

To calculate covariance with numpy, we use:


np.cov(data)
Output:
array(565.6)


Conclusion

During this story, we present some important statistics concept used in data science. You can found the source code here.



0 comments

Recent Posts

See All
bottom of page