top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Data Analysis with Pandas functions


Introduction

In this blog, I want to explain the Pandas function that is helpful to find interesting information from data. I used the ICC T20 World Cup 2021 data shown https://www.kaggle.com/affanazhar/icct20-worldcup2021 .


1. Loading and Assessing Basic Information From Data:


Start with importing the Pandas library:

import pandas as pd

To load the data from the Comma Separated Value (csv file), we use pd.read_csv() function.

data = pd.read_csv(r'C:\Users\Mairaj-PC\Desktop\worldt20_data.csv')
data.head()

To get some basic information from the data, we use dataframe.info() method

# check the description of data in detail
data.info()
 <class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               33 non-null     object 
 1   team_1                   33 non-null     object 
 2   team_2                   33 non-null     object 
 3   stage                    33 non-null     object 
 4   Winner_toss              33 non-null     object 
 5   Toss_descision           33 non-null     object 
 6   time                     33 non-null     object 
 7   venue                    33 non-null     object 
 8   avg_temperature          33 non-null     int64  
 9   best_bowler              33 non-null     object 
 10  bowling_arm              33 non-null     object 
 11  bowling_style            33 non-null     object 
 12  most_individual_wickets  33 non-null     int64  
 13  economy                  33 non-null     float64
 14  best_bowler_country      33 non-null     object 
 15  best_batter              33 non-null     object 
 16  batting_hand             33 non-null     object 
 17  high_indvidual_scores    33 non-null     int64  
 18  strike_rate              33 non-null     float64
 19  best_batter_team         33 non-null     object 
 20  target                   33 non-null     int64  
 21  target_achieved          33 non-null     int64  
 22  Player_of_the_match      33 non-null     object 
 23  Winner                   33 non-null     object 
dtypes: float64(2), int64(5), object(17)
memory usage: 6.3+ KB

The info function defines as the data have 32 rows and 24 columns. There are no null values in the data.


We also see a basic statistical description of data by using a dataframe.describe() function. This function shows us the statistical summary of all quantitative variables in the data.

# check the basic statistical summary of data
data.describe()

The described function shows us mean, standard deviation and many more. We get a many of information from this function like the mean temperature during the World Cup is 29.27 degrees Celcius and the highest individual score is 101.


2. Sorting, Counting and Applying a Condition to Data:


For sorting data either in ascending or descending order, we use dataframe.sort_values() function. By using this function, we find the top three highest targets during the World Cup.

# sort the data from highest to lowest target of each match
data.sort_values('target', ascending = False).head(3)

The output shows us the top three targets set by India, Afghanistan and Pakistan of 211, 191 and 190 respectively.


Sort values function also apply with multiple columns shown below:

# sorting with multiple columns (lowest by target and highest by temperature)
data.sort_values(['target', 'avg_temperature'], ascending = [True, False]).head(3)

The smallest target during the World Cup is 56 in Dubai with a hot temperature of 33 degrees Celcius.


We also examine our data by applying conditions by using and, or and not operators. An example of this operator is shown below:

# find the player those have 50 plus individual scores and target not achieved
data[(data['high_indvidual_scores'] > 50) & (data['target_achieved'] == 0)].head()

For explaining the information from categorical data to get counts of categorical variables. We use 'dataframe.value_counts()' method.

The T20 World Cup 2021 is in the United Arab Emirates. There are three venues Dubai, Sharjah and Abu Dhabi. We check that how many matches are played in each venue?

# find the count of matches in each venue
data['venue'].value_counts()
Dubai        13
Abu_Dhabi    11
Sharjah       9
Name: venue, dtype: int64

The output shows that Dubai Stadium covers more matches than the other two.


Take another example to see which country won the most matches in this World cup.

# find the most number of wins by each country
most_wins = data['Winner'].value_counts()
most_wins.head(4)
Australia      6
New_Zealand    5
Pakistan       5
SouthAfrica    4
Name: Winner, dtype: int64

Output is similar to expectation, Australia won the most matches therfore they are champion of this World Cup.


3. Group Summary Statistics From Data :


To make data split into separate groups to perform computations for better analysis. We use dataframe.groupby() function. We understand this concept with several examples.


First, we find the mean temperature in each venue of the World Cup. We use groupby() function as follows:

# average temperatue in each venue
data.groupby('venue')['avg_temperature'].mean()
venue
Abu_Dhabi    28.363636
Dubai        29.769231
Sharjah      29.666667
Name: avg_temperature, dtype: float64

The output shows us that Dubai has a slightly hotter temperature as compared to others.


We also apply groupby() function in multiple columns.

# find the summary statistics of strike rate and target for each winning country
data.groupby('Winner')[['strike_rate', 'target']].agg([min, max])

Let's take another example to find out the most economical bowling spell of the World Cup.

# find the most economical bowler 
data.groupby('best_bowler')['economy'].mean().sort_values().head(3)
best_bowler
Adil_Rashid      0.90
Hamid_Hassan     2.25
Anrich_Nortje    2.40
Name: economy, dtype: float64

Adil Rashid from England bowled the most economical spell of bowling with an average economy of 0.91 per over.


4. Indexing and Slicing:


We also slice and index data with any columns as per our requirement. We use two functions to apply the indexing. These are:

  • dataframe.loc[] : gets rows (and/or columns) with particular labels.

  • dataframe.iloc[]: gets rows (and/or columns) at integer locations.

Let's learn with examples, we want to find the tosses won by Pakistan team. We find by setting the index column to the 'Winner_toss" column by using dataframe.set_index() command.

# make the index set to winner toss to see which have more toss win
new_data = data.set_index('Winner_toss')
new_data.head(3)

Now find the tosses won by Pakistan team by using dataframe.loc[] method.

# subset the new data with few columns
selective_new_data = new_data[['team_2', 'stage']]

# print result of Pakistan
print(selective_new_data.loc['Pakistan'])
                  team_2        stage
Winner_toss                          
Pakistan           India  Group_stage
Pakistan     New_Zealand  Group_stage
Pakistan         Namibia  Group_stage
Pakistan        ScotLand  Group_stage

The output said that Pakistan won the 4 tosses, all in the group stage matches.


Let's look at the rows of matches 15 to 18 with dataframe.iloc[] method .

# we also use indexing by number using .iloc[]
# See the match 15 to 18
data.iloc[14:18,:5]

5. Plotting with Pandas:


Visualization is an important part of our analysis. Let's visualize the findings.

Make a bar plot to see the number of matches won by each team.

# Create a vraible most_wins that show number of matches won by every team
# lets see result with bar plot
most_wins.plot(kind = 'bar')

The output shows the winning matches details of 10 countries out of 12 participating countries because 2 countries Bangladesh and Scotland did not win a single match in the World Cup.


Let's find the relation between strike rate and targets chased. The chance is the highest target chased with a high strike rate so both have an increasing relationship. Let us check by using a scatter plot.

# lets find a relation with strike rate and target set in those game when target is chased
target_chased = data[data['target_achieved'] == 1]

# for this, we use scatter plot
target_chased.plot(x = 'target', y = 'strike_rate', kind = 'scatter')

The output does not show any strong increased relationship because small targets also chased with high strike rates.


Conclusion:


Pandas functions are very helpful to find insight from the data. They help us to get a better understanding of our data with a versatile approach. By the use of Pandas techniques in our analysis, we find the following information from our data:

  • India set the highest target of 211 against Afghanistan.

  • The smallest target of the World Cup is 56.

  • The individual highest score by a batsman is 101.

  • Dubai Stadium covers the most matches i.e 13 with hottest average temperature as compared to the other two venues.

  • Australia won the most matches i.e 6.

  • Adil Rashid from England bowled the most economical spell of the World Cup i.e 0.91 per over.

  • Bangladesh and Scotland did not win a single match in the World Cup.

I hope this blog is helpful for you to learn some Pandas functions. The GitHub repository of this blog is https://github.com/MuhammadMairajSaleem92/Python-Pandas-Techniques/blob/main/Pandas%20Techniques.ipynb



0 comments

Recent Posts

See All

Comentarios


bottom of page