Alberta Johnson

Nov 27, 20215 min

Useful Pandas Techniques in Dealing with Data Using python.

I bet you know that real-world data is a messy one. And we can not make inferences from messy data. That is why Python has libraries such as the Pandas to make it easier for us.

Before we dive right into it, let's talk about Pandas.

What is Pandas in Python?

pandas is a necessary python programming library, it is a fast, powerful, flexible, and easy-to-use open-source tool that is used for data analysis and manipulation. The pandas tool is one of the most popular data-wrangling and data manipulation packages. The tool pandas work well with many other data science modules inside the Python ecosystem.

Pandas make it simple to perform the following data manipulations with ease

  • Extracting

  • Filtering

  • Transforming data into DataFrame

  • Clearing a path for a quick and reliable data

  • etc.

You know there are a lot of Pandas Functions for data manipulation in python and it is also a very powerful library for data wrangling.

Let's focus on five of these Functions;

  1. pandas.apply()

  2. pandas to_datatime()

  3. pandas pivot_table()

  4. pandas groupby()

  5. pandas value_counts()

Let's get started;

In this blog, we will explore some of the functions that are very useful for the day-to-day data science journey. We will use the famous 911 dataset from Kaggle for this work.

A must-know is that all the examples we will use in this blog are that of the 911 dataset.

The first thing first is for to us import the needed library which is pandas

# import pandas library
 
import pandas as pd

Now, we can load the file. Since the file is a CSV we use the function read _csv()

# load the data

df = pd.read_csv('911.csv')
 
df.head()

Now, on our main focus let us talk about the pandas functions

PANDAS .apply()

The pandas function, Pandas .apply() is a function that allows us to pass a function and apply that function on every single value of the Pandas series. It allows us to manipulate the columns and rows of DataFrame.

Syntax:

df['s'] = df['s'].apply()

The Pandas function acts as a map() function in Python. It takes a function as an input and applies that function to an entire DataFrame.

Let's take an example;

In this case, we are iterating over all the rows in the series and get getting letters before the colon This will help us create a new column called reason thus the reason why people call the police.

df['title']

Output:

0 EMS: BACK PAINS/INJURY
 
1 EMS: DIABETIC EMERGENCY
 
2 Fire: GAS-ODOR/LEAK
 
3 EMS: CARDIAC EMERGENCY
 
4 EMS: DIZZINESS
 
...
 
99487 Traffic: VEHICLE ACCIDENT -
 
99488 Traffic: VEHICLE ACCIDENT -
 
99489 EMS: FALL VICTIM
 
99490 EMS: NAUSEA/VOMITING
 
99491 Traffic: VEHICLE ACCIDENT -
 
Name: title, Length: 99492, dtype: object

Look here;

df['Reason'] = df['title'].apply(lambda title: title.split(':')[0])

df['Reason']

Output:

0 EMS
 
1 EMS
 
2 Fire
 
3 EMS
 
4 EMS
 
...
 
99487 Traffic
 
99488 Traffic
 
99489 EMS
 
99490 EMS
 
99491 Traffic
 
Name: Reason, Length: 99492, dtype: object

The example above with the code snippets demonstrates how to apply the .apply() function.

Moving on to the next,

PANDAS to_datetime()

When we import a CSV file and a Data Frame is made, the Date Time objects in the file are read as a string rather than a Date Time object. Because of that, it makes it very difficult to perform operations like Time difference on a string. Pandas to_datetime() function allows us to covert an argument in a string Date time into Python Date time object.

This is going to help us to extract some useful information such us, time, days of the week, month, year, from the string.

Syntax:

df['s'] = pd.to_datetime(arg)
 

Parameters:

arg: An integer, string, float, list, or dict object to convert into a Date time object.

Let us look at how we are going to apply the to_datetime() function here,

Looking at an example to demonstrate how to use the .to_datetime()

From the code below, we want to know the hour, month and day of the

week that people call the police.

df['timeStamp'] = pd.to_datetime(df['timeStamp'])
 
df['timeStamp']

Output:

0 2015-12-10 17:40:00
 
1 2015-12-10 17:40:00
 
2 2015-12-10 17:40:00
 
3 2015-12-10 17:40:01
 
4 2015-12-10 17:40:01
 
...
 
99487 2016-08-24 11:06:00
 
99488 2016-08-24 11:07:02
 
99489 2016-08-24 11:12:00
 
99490 2016-08-24 11:17:01
 
99491 2016-08-24 11:17:02
 
Name: timeStamp, Length: 99492, dtype: datetime64[ns]

Lets continue coding;

df['Hour'] = df['timeStamp'].apply(lambda time: time.hour)
 
df['Month'] = df['timeStamp'].apply(lambda time: time.month)
 
df['Day of Week'] = df['timeStamp'].apply(lambda time: time.dayofweek)

df['Day of Week']

Output:

0 3
 
1 3
 
2 3
 
3 3
 
4 3
 
..
 
99487 2
 
99488 2
 
99489 2
 
99490 2
 
99491 2
 
Name: Day of Week, Length: 99492, dtype: int64

Loading the DataFrame for Hour

df['Hour']

Output:

0 17
 
1 17
 
2 17
 
3 17
 
4 17
 
..
 
99487 11
 
99488 11
 
99489 11
 
99490 11
 
99491 11
 
Name: Hour, Length: 99492, dtype: int64

Moving on to the next function,

PANDAS pivot_table()

Pivot_table() function is a pandas function that is used to create a spreadsheet-style pivot table as a DataFrame and operate in tabular data. Marking the data more reading and easy to understand.

Syntax:

pd.pivot_table()

Let's work on an example, of how to apply the .pivot_table()

What we want to do in our case is to create a table such that, the days of the week will be the index, the month will be the column, and then populate each cell with the count of various user call reasons.

dayMonth = pd.pivot_table(df,index='Day of Week',values='Reason',columns='Month',aggfunc='count')

dayMonth

Output:

Next,

Pandas groupby

Groupby function is a versatile function in python. This function allows us to group our data into different groups to enable us to perform analytical computation for better understanding.

A groupby operation involves combining the objects which have been split, then applying a function to it, and after combining the results. The pandas groupby function is used to group different large amounts of data and after that operates on these groups created.

Pandas groupby is used for grouping the data according to the categories and applying a function to the categories.

It gives a better idea and understanding of the groups, for further analysis.

Syntax:

DataFrame.groupby()

Let's find the day people call the police the most. We can do that by simply grouping the data by the day of the week the field then we apply the count on it.

df.groupby(['Day of Week']).count()

Output:

We are now Combining another function called unstack() we can generate the same spread-sheet table created with pivot_table above.

df.groupby(['Day of Week','Month']).count()['Reason'].unstack()

Output:

Last but not the least function is the;

PANDAS value_counts()
 

Another useful pandas function is the value_counts(). Essentially, what the value_counts does is that it counts the objects in the DataFrame with unique values. This technique pd.value_counts() is mostly used for data wrangling, and data exploration in Python.

The value_counts() function in Pandas returns the output which contains counts of unique values. The results will be in descending order so that the element which occurs frequently comes first.

Again, It excludes Not Available (NA) values by default.

The value_counts() method returns an output containing counts of unique values in sorted order.

Syntax:

df.value_counts()

Let's go-ahead to find out the top 20 towns that call the police a lot. By applying the value_counts and the head function, we have been able to get the top 20 towns with the most calls in descending order of magnitude. Using the .head() function.

df['twp'].value_counts().head(20)

We can go ahead to get the percentage of each town by just setting the normalize attribute to True

df['twp'].value_counts(normalize=True).head(20)

Output:

LOWER MERION 0.084898
 
ABINGTON 0.060101
 
NORRISTOWN 0.059226
 
UPPER MERION 0.052560
 
CHELTENHAM 0.046003
 
POTTSTOWN 0.041690
 
UPPER MORELAND 0.034530
 
LOWER PROVIDENCE 0.032429
 
PLYMOUTH 0.031755
 
HORSHAM 0.030196
 
MONTGOMERY 0.027129
 
UPPER DUBLIN 0.026 526
 
WHITEMARSH 0.025400
 
UPPER PROVIDENCE 0.023258
 
LIMERICK 0.022846
 
SPRINGFIELD 0.022142
 
WHITPAIN 0.021468
 
EAST NORRITON 0.020493
 
LANSDALE 0.017527
 

 
HATFIELD TOWNSHIP 0.017416
 
Name: twp, dtype: float64

Conclusion:

The above are some Tutorials on Useful Pandas Techniques in Dealing with Data Using python.

Note:

A full glance at the code snippets can be found on my GitHub repository.

Feel free to check them out here

HAPPY READING!!!!

Reference:

  1. https://www.geeksforgeeks.org/ - For more explanation on the pandas function and others

  2. Data manipulation with pandas - DataCamp

    2