top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Pandas techniques - part 3 - Merging Data Frames

Due to the development of the relational database management system (RDMS), combining and merging data has become an essential programming work in data analysis.

What is the relational database management system?

It is a common type of database that stores data in separate tables, and it can be used in relation to other stored datasets. For example, suppose a database for a company may contain information about the employees, salaries, departments, dates of each employee, the rank of each employee, and so on. This table will have too many columns and will have difficult to access, slow speed, low accuracy. Here RDMS comes into play. It separates into tables with columns of the same types. For example, employee table, department table, salary table. Each table has a primary key and may or may not have foreign keys. The primary key is a specific column for that table and it contains a unique value for each row and represents that table. A foreign key is a primary key from another table. For example, employee ID is the primary key for the employee table and department ID is the primary key for the department table but department ID is a foreign key for the employee table. We can use those keys to connect between tables. I am just talking about the concepts of RDMS and I will not focus on those keys. I will show the idea of how we can merge tables in python with Pandas.


Merging DataFrames


The main idea of merging tables is how to merge.


Suppose we want to combine two tables, the key columns from both tables serve as the pivot column. Depends on the values from these columns, the two tables will merge. There are four types of joins in general.


Join/Inner join - only the same values from each key columns

Full/Outer join - all the values from each key column

Left join - all the values from the left table

Right join - all the values from the right table


Let's see in action in Pandas.


Inner Join

df1 = pd.DataFrame({'key':['a','a','b','b','f','f'],
                   'data':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'key':['a','b','c','d','e'],
                   'data':[10,20,30,40,50]})
df1
Out[2]: 
  key  data
0   a     1
1   a     2
2   b     3
3   b     4
4   f     5
5   f     6
In [3]: df2
Out[3]: 
  key  data
0   a    10
1   b    20
2   c    30
3   d    40
4   e    50

We can use the merge method to combine, the default is the inner join.

In [4]: pd.merge(df1, df2, on='key')
Out[4]: 
  key  data_x  data_y
0   a       1      10
1   a       2      10
2   b       3      20
3   b       4      20

Look at the key column and compare it with the tables before combining. It is combined with the key values that are contained in both tables.

Left join

In [6]: df1 = pd.DataFrame({'key':['a','a','b','b','f','f'],
   ...:                    'data':[1,2,3,4,5,6]})
   ...: df2 = pd.DataFrame({'key':['a','b','c','d','e'],
   ...:                    'data':[10,20,30,40,50]})
   ...: df1
Out[6]: 
  key  data
0   a     1
1   a     2
2   b     3
3   b     4
4   f     5
5   f     6
In [7]: df2
Out[7]: 
  key  data
0   a    10
1   b    20
2   c    30
3   d    40
4   e    50

For left join, we can pass the parameter, how with left.

In [8]: pd.merge(df1, df2, on='key', how='left')
Out[8]: 
  key  data_x  data_y
0   a       1    10.0
1   a       2    10.0
2   b       3    20.0
3   b       4    20.0
4   f       5     NaN
5   f       6     NaN

It focuses on the key value of table 1 and so, the values from table 2 become missing values as table 2 does not contain key values of 'f'.

The Right join is the same as the left join. It will get the same result if you pass the parameters df2 first and df1 second.


Outer Join

The outer join will combine all the key values from both tables.

In [9]: pd.merge(df1, df2, on='key', how='outer')
Out[9]: 
  key  data_x  data_y
0   a     1.0    10.0
1   a     2.0    10.0
2   b     3.0    20.0
3   b     4.0    20.0
4   f     5.0     NaN
5   f     6.0     NaN
6   c     NaN    30.0
7   d     NaN    40.0
8   e     NaN    50.0


Different key column names


If the two tables have different key column names, you can use the parameters left_on and right_on to merge.

In [10]: df1 = pd.DataFrame(
              {'key1':['a','a','b','b','f','f'],
               'data':[1,2,3,4,5,6]})
    ...: df2 = pd.DataFrame(
              {'key2':['a','b','c','d','e'],
    ...:       'data':[10,20,30,40,50]})
    ...: pd.merge(df1, df2, left_on='key1', 
                  right_on='key2')

Out[10]: 
  key1  data_x key2  data_y
0    a       1    a      10
1    a       2    a      10
2    b       3    b      20
3    b       4    b      20

More than one key column


If the two tables have more than one key column, you can pass the on parameter with a list of key column names.

In [11]: df1 = pd.DataFrame(
                      {'key1':['aaa','aaa','bbb'],
    ...:               'key2':['one','three','one'],
    ...:               'lval':[1,2,3]})
    ...: df2 = pd.DataFrame(
                   {'key1':['aaa','bbb','aaa','bbb'],
    ...:            'key2':['one','one','two','three'],
    ...:            'rval':[10,20,30,40]})
    ...: df1
Out[11]: 
  key1   key2  lval
0  aaa    one     1
1  aaa  three     2
2  bbb    one     3

In [12]: df2
Out[12]: 
  key1   key2  rval
0  aaa    one    10
1  bbb    one    20
2  aaa    two    30
3  bbb  three    40
In [13]: pd.merge(df1, df2, on=['key1','key2'], 
                                   how='outer')
Out[13]: 
  key1   key2  lval  rval
0  aaa    one   1.0  10.0
1  aaa  three   2.0   NaN
2  bbb    one   3.0  20.0
3  aaa    two   NaN  30.0
4  bbb  three   NaN  40.0

Using Suffixes


You can use suffixes for defining custom column names after merging the two tables. For example, the default name is x and y.

In [14]: pd.merge(df1, df2, on='key1')
Out[14]: 
  key1 key2_x  lval key2_y  rval
0  aaa    one     1    one    10
1  aaa    one     1    two    30
2  aaa  three     2    one    10
3  aaa  three     2    two    30
4  bbb    one     3    one    20
5  bbb    one     3  three    40

You can pass the suffixes parameter to define the column names.

In [15]: pd.merge(df1, df2, on='key1', suffixes= 
                              ('_left','_right'))
Out[15]: 
  key1 key2_left  lval key2_right  rval
0  aaa       one     1        one    10
1  aaa       one     1        two    30
2  aaa     three     2        one    10
3  aaa     three     2        two    30
4  bbb       one     3        one    20
5  bbb       one     3      three    40


Merging on Index


You can also merge on the index by passing right_index or left_index with boolean values.



In [16]: df1 = pd.DataFrame(
                   {'key':['a','b','c','a','b','f'],
    ...:            'value':range(6)})
    ...: df2 = pd.DataFrame({'value':[10,20]}, 
                              index=['a','b'])
    ...: df1
Out[16]: 
  key  value
0   a      0
1   b      1
2   c      2
3   a      3
4   b      4
5   f      5

In [17]: df2
Out[17]: 
   value
a     10
b     20

In [18]: pd.merge(df1, df2, left_on='key',right_index=True)
Out[18]: 
  key  value_x  value_y
0   a        0       10
3   a        3       10
1   b        1       20
4   b        4       20


Hierarchically-index data


We can also use Hierarchically-index data to join with the same methods.

In [19]: df1 = pd.DataFrame(
          {'key1':['A','A','B','B','D','D'],
    ...:  'key2':['aaa','bbb','aaa','bbb','ddd','eee'],
    ...:  'data':np.arange(6)})
    ...: df2 = pd.DataFrame(np.arange(12) 
                              .reshape((6,2)),
    ...:            index=[['A','A','B','B','B','B'],
    ...:        ['aaa','ccc','aaa','bbb','ccc','ddd']],
    ...:            columns=['val1','val2'])
    ...: df
Out[19]: 
  key1 key2  data
0    A  aaa     0
1    A  bbb     1
2    B  aaa     2
3    B  bbb     3
4    D  ddd     4
5    D  eee     5

In [20]: df2
Out[20]: 
       val1  val2
A aaa     0     1
  ccc     2     3
B aaa     4     5
  bbb     6     7
  ccc     8     9
  ddd    10    11
  
In [21]: pd.merge(df1, df2, left_on=['key1','key2'], 
                                 right_index=True)
Out[21]: 
  key1 key2  data  val1  val2
0    A  aaa     0     0     1
2    B  aaa     2     4     5
3    B  bbb     3     6     7

Concatenation tables


Not same as above merging methods, table concatenation is just to combine rows and columns without using key columns. We can use pd.concat method and passing the list of tables that want to be combined and specify the axis ( 0 for bind rows, and 1 for bind columns ).

In [22]: df1 = pd.DataFrame({'Name':['A','B','C','D'],
    ...:                    'Value':np.arange(4)})
    ...: df2 = pd.DataFrame({'Name':['E','F','G'],
    ...:                    'Value':np.arange(3)})
    ...: df1
Out[22]: 
  Name  Value
0    A      0
1    B      1
2    C      2
3    D      3

In [23]: df2
Out[23]: 
  Name  Value
0    E      0
1    F      1
2    G      2

In [24]: pd.concat([df1,df2],axis=0)
Out[24]: 
  Name  Value
0    A      0
1    B      1
2    C      2
3    D      3
0    E      0
1    F      1
2    G      2

In [25]: pd.concat([df1,df2],axis=1)
Out[25]: 
  Name  Value Name  Value
0    A      0    E    0.0
1    B      1    F    1.0
2    C      2    G    2.0
3    D      3  NaN    NaN

Combining data with Overlapping


Start with an example.

In [26]: df1 = pd.DataFrame({'a':[1,np.nan,3,4,np.nan],
    ...:                    'b':[6,7,np.nan,8,9],
    ...:                    'c':[10,11,12,np.nan,14]})
    ...: df2 = pd.DataFrame({'a':[10,20,30,np.nan,50],
    ...:            'b':[60,70,80,np.nan, np.nan],
    ...:            'c':[100,np.nan, 130,np.nan, 150]})
    ...: df1
Out[26]: 
     a    b     c
0  1.0  6.0  10.0
1  NaN  7.0  11.0
2  3.0  NaN  12.0
3  4.0  8.0   NaN
4  NaN  9.0  14.0

In [27]: df2
Out[27]: 
      a     b      c
0  10.0  60.0  100.0
1  20.0  70.0    NaN
2  30.0  80.0  130.0
3   NaN   NaN    NaN
4  50.0   NaN  150.0

Look at the NaN values. The idea is to get all the values from table 1 and if there are missing values in table 1, take that value from table 2. We can use the combine_first method.

In [28]: df1.combine_first(df2)
Out[28]: 
      a     b     c
0   1.0   6.0  10.0
1  20.0   7.0  11.0
2   3.0  80.0  12.0
3   4.0   8.0   NaN
4  50.0   9.0  14.0

To be clear you can use the where method from NumPy.

In [29]: np.where(pd.isna(df1),df2,df1)
Out[29]: 
array([[ 1.,  6., 10.],
       [20.,  7., 11.],
       [ 3., 80., 12.],
       [ 4.,  8., nan],
       [50.,  9., 14.]])

If there is a missing value in df1, then take from df2, and if not missing, take the value from df1.


This is an introduction to how to combine and merge data frames in Pandas. Only one important idea is how they merge and join; inner, outer, left, and right.


0 comments

Recent Posts

See All

Comments


bottom of page