top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Pandas Data Manipulation Tools

Author: Rashidat Sikiru


Introduction

Pandas is an open-source python library for highly specialized data analysis. It is a perfect tool for anyone who wants to perform data analysis using python as a programming language. In this article, I will be taking us through the pandas techniques for data manipulation. There are several data manipulation tools, but this article will we be discussing five powerful pandas data manipulation tools. These are:

1. Merging

2. Mapping

3. Removing Duplicates

4. Binning

5. Sorting


1. Merging

Merging also referred to as joining consists of a combination of data frames through the connection of rows using one or more keys. It can be likened to the “JOIN” operation for those who are familiar with the Structured Query Language(SQL) and merge() is the function to perform this kind of operation.


Let’s import the pandas library and define two data frames that will serve as examples for this section:


# import pandas and numpy
import numpy as np
import pandas as pd

# define the first dataframe
data1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                      'price': [12.33,11.44,33.21,13.23,33.62]})
data1
   id     price 
0  ball    12.33
1  pencil  11.44
2  pen     33.21
3  mug     13.23
4  ashtray 33.62
# define the second dataframe
data2 = pd.DataFrame( {'id':['pencil','pencil','ball','pen'], 'color': ['white','red','red','black']})
data2

    id      color
0   pencil  white
1   pencil  red
2   ball    red
3   pen    black

Now we can carry out merging by applying the merge() function to the two data frame objects.

pd.merge(data1,data2)

    id     price    color
0   ball   12.33    red
1   pencil 11.44    white
2   pencil 11.44    red
3   pen    33.21    black

We can see from the result, the returned data frame consists of all rows that have an ID in common. In addition to the common column, the columns from the first and the second data frame are added. In this case, we used the merge() function without specifying any column explicitly.


However, we will come across cases where we need to decide which column on which to base the merging. To do this, we need to add the ON option with the column name as the key for the merging.


Let's do this by adding another column named "brand" to the two data frames so that we can have two similar columns on each data frame.

data1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],  
                      'color':  ['white','red','red','black','green'], 
                      'brand': ['OMG','ABC','ABC','POD','POD']})
data1
     id       color     brand
0    ball     white     OMG
1    pencil   red       ABC
2    pen      red       ABC
3    mug      black     POD
4    ashtray  green     POD
data2 = pd.DataFrame( {'id':['pencil','pencil','ball','pen'],
                     'brand': ['OMG','POD','ABC','POD']})
data2
     id     brand
0  pencil   OMG
1  pencil   POD
2  ball     ABC
3  pen      POD

Now that we have two data frames having columns with the same name, let us use our merge() function.

pd.merge(data1,data2)
id color  brand

We can see that our code returns an empty data frame because it does not know which of the columns it should carry out the merging.


So let's try to explicitly define the criteria for merging by specifying the name of the key column in the ON option.

# merging using id
pd.merge(data1,data2,on='id')
   id    color   brand_x   brand_y
0 ball   white   OMG       ABC
1 pencil red     ABC       OMG
2 pencil red     ABC       POD
3 pen    red     ABC       POD
# merging using brand
pd.merge(data1, data2, on="brand")

id_xobjectcolorobjectbrandobjectid_yobject

0

ball

white

OMG

pencil

1

pencil

red

ABC

ball

2

pen

red

ABC

ball

3

mug

black

POD

pencil

4

mug

black

POD

pen

5

ashtray

green

POD

pencil

6

ashtray

green

POD

pen


0 comments

Recent Posts

See All

Comments


bottom of page