# Statistical Concepts for Data Science

Statistics is a form of mathematical analysis which uses models and representations for a particular sets of data which makes our understandings easier. Statistics is mainly focused on data collection, data organisation, data analysis, data interpretation, and data visualisation.

Let us look into 10 important Statistical Concepts used in Data Science with examples.

For that first we will import all the important libraries that we will use.

**import** statistics
**import** pandas **as** pd
**import** numpy **as** np
**import** matplotlib.pyplot **as** plt
**from** scipy.stats **import** norm
**from** scipy **import** stats
**import** seaborn **as** sns

**1. Measures of Central Tendency**

Measure of central tendency shows a single value that gives the central value of the data sets. It points the central location of the datas.

Mean: The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. We can calculate the mean using statistics.mean() method.

`statistics`**.**mean([1,3,3,5,7,8])
output:
4.5

Median:The median is the middle value for a set of data that has been arranged in order of magnitude. Similarly we calculate the median with statistics.median() method.

`statistics`**.**median([1,3,5,7,9])
output:
5

Mode: Mode is the most frequently occurring data in the data set.

`statistics`**.**mode([1,3,3,5,7,8])
output:
3

**2**. **Population and Sample**

In statistics population refers to the entire set of data for the statistical study. It can be a group of individuals, objects, events, organisation, etc. An Example of Population could be all the students in a school. It is not always feasible to do analysis on population because we cannot collect all the data of a population. Therefore, we use samples.

A sample represents a group of data selected from the population. Samples should be randomly selected and should represent the entire population within it.

**3.Variance and Standard Deviation**

Standard Deviation is a measure of spread in Statistics. A low standard deviation means that most of the numbers are close to the mean value. A high standard deviation means that the values are spread out over a wider range.

`speed `**=** [32,111,138,28,59,77,97]
x **=** np**.**std(speed)
print(x)
output:
37.84501153334721

Variance is another number that indicates how spread out the values are. Variance is the square of Standard deviation.

`speed `**=** [32,111,138,28,59,77,97]
x **=** np**.**var(speed)
print(x)
output:
1432.2448979591834

**4. Normal Distribution**

Normal distribution can be thought like a bell curve with most of its values scattered around it's central peak called the mean of the distribution. Here the probability of getting values near the mean is higher than the values away from the mean. The points in normal distribution are symmetrical.

*We will create a normal distribution with mean 5 and standard deviation 10.*

`m `**=** 5
std **=** 10
snd **=** norm(m, std)

*To generate 1000 random values between -100, 100*

`x `**=** np**.**linspace(**-**100, 100, 1000)

*Plot the normal distribution curve for different values of random variable falling in the range -100 to 100*

`plt`**.**figure(figsize**=**(7.5,7.5))
plt**.**plot(x, snd**.**pdf(x))
plt**.**xlim(**-**60, 60)
plt**.**title('Normal Distribution Curve (Mean = 5, STD = 10)', fontsize**=**'10')
plt**.**xlabel('Values of Random Variable X', fontsize**=**'15')
plt**.**ylabel('Probability', fontsize**=**'15')
plt**.**show()

**5. Skewness**

Skewness is asymmetry that deviates from the symmetrical bell curve, or normal distribution curve in a set of data, it measures the level of how much a given distribution is different from a normal distribution.

If the curve is shifted to the left or to the right, it is said to be skewed. Positive skew : when the distribution has a thicker right tail and mode<median<mean.

Negative skew : when the distribution has a thicker left tail and mode>median>mean.

Zero : when the distribution is symmetric about its mean and approximately mode=median=mean.

**6. Central Limit Theorem**

The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger, no matter what the shape of the population distribution is.

*#lets roll a die **10** times*
die **=** pd**.**Series([1, 2, 3, 4, 5, 6])
roll_die **=** die**.**sample(10, replace**=****True**)
print(roll_die)
output:
0 1
1 2
4 5
5 6
1 2
3 4
5 6
2 3
0 1
0 1
dtype: int64

*# Rolling the dice **10** times **10** times*
sample_means **=** []
**for** i **in** range(10):
roll_die **=** die**.**sample(10, replace**=****True**)
sample_means**.**append(np**.**mean(roll_die))
print(sample_means)
*# Convert to Series and plot histogram*
sample_means_series **=** pd**.**Series(sample_means)
sample_means_series**.**hist()
*# Show plot*
plt**.**show()
output:
[3.7, 3.1, 4.3, 4.5, 3.5, 2.9, 3.3, 4.4, 4.1, 4.6]

*# **100** sample means*
sample_means **=** []
**for** i **in** range(100):
sample_means**.**append(np**.**mean(die**.**sample(10, replace**=****True**)))
*# Convert to Series and plot histogram*
sample_means_series **=** pd**.**Series(sample_means)
sample_means_series**.**hist()
*# Show plot*
plt**.**show()

*# **1000** sample means*
sample_means **=** []
**for** i **in** range(1000):
sample_means**.**append(np**.**mean(die**.**sample(10, replace**=****True**)))
*# Convert to Series and plot histogram*
sample_means_series **=** pd**.**Series(sample_means)
sample_means_series**.**hist()
*# Show plot*
plt**.**show()

From above we came to know that as we keep on increasing the sample size our sample mean tends to show normal distribution curve.

**7. Percentiles**

Percentiles are used in statistics to give a number that describes the value that a given percent of the values are lower than.

*#To find which age are falling below **75**%*
ages **=** [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
x **=** np**.**percentile(ages, 75)
print(x)
output:
43.0

The above example illustrates that 75% of the people are 43 or younger.

*#create array **of** **100** random integers distributed between **0** and **500*
data **=** np**.**random**.**randint(0, 500, 100)
*#find the **37**th percentile **of** the array*
np**.**percentile(data, 37)
output:
155.63

*#Find the **quartiles** **(**25**th**,** **50**th**,** and **75**th percentiles**)** **of** the array*
np**.**percentile(data, [25, 50, 75])
output:
array([ 92.25, 223.5 , 398.25])

**8. Linear Regression**

Linear regression in statistics is used to draw a straight line between different data points that passes through them all. With the help of this line we can also predict the future values.

We have two sets of data x and y as follows

`x `**=** [5,7,8,7,2,17,2,9,4,11,12,9,6]
y **=** [99,86,87,88,111,86,103,87,94,78,77,85,86]

We use *some important key values of Linear Regression.*

`slope, intercept, r, p, std_err `**=** stats**.**linregress(x, y)

*Next we create a function that uses the slope and intercept values to return a new value.The new value represents where on the y-axis the corresponding x value will be placed.*

**def** reg(x):
**return** slope ***** x **+** intercept

*Run each value of the x array through the function*

`plot `**=** list(map(reg, x))

*#create scater plot*
plt**.**scatter(x, y)
*#Draw the line **of** linear regression*
plt**.**plot(x, plot)
plt**.**show()

**9. Binomial Distribution**

Binomial distribution is the type of distribution that has two possible outcomes. It can be thought as a probability of predicting success or failure like in case of tossing a fair coin.

It takes three parameters, n is number of trials, p is probability of occurrence of each trial (e.g. for toss of a coin 0.5 each) and size is the shape of the array.

`x `**=** np**.**random**.**binomial(n**=**10, p**=**0.5, size**=**20)
print(x)
output:
[6 2 4 5 4 5 3 1 5 3 4 4 6 6 3 8 4 2 8 4]

*For visualising binomial distribution*

`sns`**.**distplot(np**.**random**.**binomial(n**=**10, p**=**0.5, size**=**1000), hist**=****True**, kde**=** **False**)

*binom.pmf() calculates the probability of having exactly k heads out of n coin flips.*

*Lets look into it with an example *

*Dorji makes 70% of his free-throw attempts. If he shoots 15 free throws, what is the probability that he makes exactly 10?*

`stats`**.**binom**.**pmf(k**=**10, n**=**15, p**=**0.7)
output:
0.20613038097752118

*binom.cdf() calculates the probability of having k heads or less out of n coin flips.*
*Sonam flips a fair coin 10 times. What is the probability she gets heads 5 times or fewer?*

`stats`**.**binom**.**cdf(k**=**5, n**=**10, p**=**0.5)
output:
0.6230468749999999

*binom.sf() calculates the probability of having more than k heads out of n coin flips.*
*Sonam flips a fair coin 10 times. What is the probability she gets heads more than 5 times?*

`stats`**.**binom**.**sf(k**=**5, n**=**10, p**=**0.5)
output:
0.3769531250000001

**10. Continuous vs Discrete Data**

Discrete data is a count that involves integers. It can only have limited number of values. The discrete values cannot be subdivided into parts. Example: number of students in a class, number of test questions answered correctly.

Continuous data is a type of numerical data that refers to the unspecified number of possible measurements between two realistic points. In continuous data we can get any value between any two points and the data are infinite unlike in case of discrete data. Example: Temperature measurement over a day, speed of a car.