top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Pandas techniques - part 2 - groupby

After loading, merging, and preparing a data set, the next important step is to compute group statistics and pandas provides a high-performance groupby facility, enabling to slice and dice, and summarize data sets in a natural way.

Hadley Wickham, an author of many popular packages for the R programing language, told about group operations as split-apply-combine. The first stage of the process is split into groups based on one or more keys and the splitting performs on a particular axis (axis=0 for rows and axis=1 for columns) of an object. Then, a function is applied to each object and producing a new value. Finally, the results are combined into a result object. This is about groupby mechanics.


Groupby basic operations


Basic syntax is

df.groupby(column/s_to_group)[[column/s_to_calculate]].function/s_to_compute( )

Let's get started. Here is a very sample data frame.

df = pd.DataFrame({'key1':['a','a','b','b','a','b'], 
          'key2:['one','two','one','two','one','two'],
                  'val1':np.random.permutation(6),
                  'val2':np.random.permutation(6)})
df
Out[3]: 
  key1 key2  val1  val2
0    a  one     0     1
1    a  two     5     3
2    b  one     2     0
3    b  two     4     2
4    a  one     1     4
5    b  two     3     5
grouped = df.groupby('key1')
grouped
Out[4]: 
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f880385dee0>

Pandas return a GroupBy Object and at this stage, it has not computed anything. For example, to compute group means we can call the mean method.

grouped.mean()
Out[5]: 
          val1      val2
key1                    
a     3.333333  3.333333
b     1.666667  1.666667

It calculated the average values from 'val1' and 'val2' according to its group column 'key1'.

We can group one or more columns.

df.groupby(['key1','key2'])['val1'].mean()

It is grouped by 'key1' and 'key2' columns and calculated the average value of 'val1' column according to those groups.

Out[6]: 
key1  key2
a     one     4.5
      two     1.0
b     one     2.0
      two     1.5
Name: val1, dtype: float64

The following code will produce the same result.

df['val1'].groupby(['key1','key2']).mean()

You can use the unstack method to spread as columns.

df.groupby(['key1','key2'])['val1'].mean().unstack()
Out[7]: 
key2  one  two
key1          
a     4.5  1.0
b     2.0  1.5

You can use one or more keys and one or more values to calculate.

df.groupby(['key1','key2'])[['val1', 'val2']].mean()
Out[8]: 
           val1  val2
key1 key2            
a    one    4.5   2.5
     two    1.0   5.0
b    one    2.0   1.0
     two    1.5   2.0

The following statistical methods can be used as 'Optimized groupby methods'.

Function Name

Description

count

Number of non-NA values in the group

sum

Sum of non-NA values

mean

Mean of non-NA values

median

Median of non-NA values

std, var

Sample standard deviation and variance of non-NA values

min, max

Minimum and maximum of non-NA values

prod

Product of non-NA values

first, last

First and last non-NA values

Groupby with agg method


You can also use statistical methods with agg method but this is much slower than the optimized functions. In agg method, you can add any statistical computation method from build-in, from NumPy, or custom functions. For example,

df.groupby(['key1','key2']).agg('mean')

It is the same as the above method for computing the average values. You can use one or more functions to calculate with agg method.

df.groupby(['key1'])['val1'].agg(['mean','sum'])

It is grouped by the 'key1' column to calculate the average and sum of the 'val1' column.

Out[9]: 
          mean  sum
key1               
a     3.333333   10
b     1.666667    5

One thing to be aware of is to use parenthesis for build-in functions and no need to use it for NumPy functions and custom functions. Here's create a new function to calculate the range of the data.

def range_(arr):
    return max(arr) - min(arr)
df.groupby(['key1'])[['val1']].agg(['mean', 
                                   'std',range_])
Out[11]: 
          val1                 
          mean       std range_
key1                           
a     3.333333  2.081666      4
b     1.666667  1.527525      3

You can change the column names of the result table by adding a list of tuples with custom names.

df.groupby(['key1'])[['val1']].agg([('Average','mean'), 
                         ('Standard Deviation',np.std),
                         ('Range',range_)])
Out[13]: 
          val1                         
       Average Standard Deviation Range
key1                                   
a     3.333333                 10     4
b     1.666667                  5     3

If you want to apply different functions to one or more of the columns, you can pass a dict to agg that contains a mapping of column names to any of the function specifications.

df.groupby(['key1','key2'])[['val1','val2']]             
                    .agg({'val1':['min','max','mean'], 
                          'val2':['std','sum']})

It is grouped by both 'key1' and 'key2' columns and minimum, maximum, and average values are calculated for the 'val1' column and standard deviation and summation are calculated for the 'val2' column.

Out[14]: 
          val1               val2    
           min max mean       std sum
key1 key2                            
a    one     4   5  4.5  0.707107   5
     two     1   1  1.0       NaN   5
b    one     2   2  2.0       NaN   1
     two     0   3  1.5  2.828427   4


In all of the examples above, the aggregated data comes back with an index, and if you do not want this index, you can disable it by passing as_index=False to groupby:

df.groupby(['key1','key2'], as_index=False).mean()
Out[15]: 
  key1 key2  val1  val2
0    a  one   4.5   2.5
1    a  two   1.0   5.0
2    b  one   2.0   1.0
3    b  two   1.5   2.0

Aggregation is only one kind of group operation. It reduces a one-dimensional array to a scalar value. There are other kinds of group operations.


Groupby with transform( )


The transform method applies a function to each group, then places the results in the appropriate locations. Let's see it with an example.

Review the data frame.

Out[16]: 
  key1 key2  val1  val2
0    a  one     5     2
1    a  two     1     5
2    b  one     2     1
3    b  two     0     4
4    a  one     4     3
5    b  two     3     0

Let's say we want to calculate the summation of 'val1' and 'val2' grouped by 'key1'.

df.groupby(['key1'])[['val1','val2']].sum()
Out[17]: 
      val1  val2
key1            
a       10    10
b        5     5

If we use the transform method, these values will be broadcasted to the appropriate locations.

df.groupby(['key1'])[['val1','val2']].transform(np.sum)
Out[18]: 
   val1  val2
0    10    10
1    10    10
2     5     5
3     5     5
4    10    10
5     5     5

Groupby with Apply( )


Like aggregate, transform, the apply method splits the object into pieces, applies the passed function on each piece, and then combines these pieces together again. Let's create a new data frame.

 df = pd.DataFrame({'data1':np.random.randn(1000),
                  'data2':np.random.randn(1000)})
category = pd.cut(df['data1'], 4)

The data frame contains two columns and 1000 rows of random numbers. The category variable contains category type data of 4 categories according to the column 'data1'.

category.head()
Out[20]: 
0    (-1.542, -0.111]
1    (-1.542, -0.111]
2    (-1.542, -0.111]
3    (-1.542, -0.111]
4     (-0.111, 1.321]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-2.979, -1.542] < (-1.542, -0.111] < (-0.111, 1.321] < (1.321, 2.752]]

First, define the statistical functions to apply.

def get_stats(arr):
    return {'Minimum':arr.min(),'Maximum':arr.max(),
           'Average':arr.mean(),'Count':arr.count()}

You can easily pass this function inside the apply method.

df['data1'].groupby([category]).apply(get_stats) 
                                       .unstack()

Out[22]: 
                   Minimum   Maximum   Average  Count
data1                                                
(-2.979, -1.542] -2.973551 -1.553749 -1.919661   61.0
(-1.542, -0.111] -1.542005 -0.113647 -0.680796  388.0
(-0.111, 1.321]  -0.110328  1.307575  0.518177  463.0
(1.321, 2.752]    1.329699  2.752306  1.797560   88.0

You can use the qcut method for cutting with the same number of observations in each group. Here is an example to create 4 groups.

category = pd.qcut(df['data1'], 4, labels=False)
category.head()
Out[23]: 
0    1
1    0
2    1
3    0
4    3
Name: data1, dtype: int64
df['data1'].groupby([category]).apply(get_stats)
                                       .unstack()
Out[24]: 
        Minimum   Maximum   Average  Count
data1                                     
0     -2.973551 -0.630337 -1.234221  250.0
1     -0.629076  0.034398 -0.297647  250.0
2      0.037742  0.684484  0.341830  250.0
3      0.687052  2.752306  1.257450  250.0

When cleaning up missing data, you can impute the missing values using a fixed value or some value you define. Let's create a data frame with missing values.

df = pd.DataFrame({'key': ['a','a','b','b','a','b'],
                  'value':[3,np.nan,4,np.nan,1,2]})
df
Out[25]: 
  key  value
0   a    3.0
1   a    NaN
2   b    4.0
3   b    NaN
4   a    1.0
5   b    2.0

Suppose you want to fill the missing values with the average value of each group according to the 'key' column.

df.groupby(['key']).mean()
Out[26]: 
     value
key       
a      2.0
b      3.0

They are the average values. If you want to fill in the missing values with the average values, you can use the lambda function inside the apply method.

df.groupby(['key'], as_index=False).apply(lambda x: 
                                 x.fillna(x.mean()))
Out[27]: 
    key  value
0 0   a    3.0
  1   a    2.0
  4   a    1.0
1 2   b    4.0
  3   b    3.0
  5   b    2.0

You can also fill the missing values with the values you define.

fill_values = {'a':10,'b':20}
df.groupby(['key'], as_index=False).apply(lambda x: 
                      x.fillna(fill_values[x.name]))
Out[28]: 
  key  value
0   a    3.0
1   a   10.0
2   b    4.0
3   b   20.0
4   a    1.0
5   b    2.0

There are many ways left you can use with groupby method. You have to remember the groupby operation is split-apply-combine and the basic syntax is

df.groupby(column/s_to_group)[[column/s_to_calculate]].function/s_to_compute( )

Pivot-table


A pivot table is a data summarization tool most commonly found in spreadsheet programs. The main idea of the pivot-table is same as groupby method.

Its most commonly used parameters are as follow:


index - column/s to group

values - column/s to calculate

columns - same as the index but these results will show as column values in the result table

aggfunc - function/s to compute

margins - add row/columns ( e.g for grand totals )

First, create a data set.


In [30]: np.random.seed(42)
    ...: df = pd.DataFrame({
           'sex': np.random.choice(['M','F'], size=10),
    ...:   'day':['Mon','Tue','Wed','Thu','Fri']*2,
    ...:   'size':np.random.choice(['S','M','L'], 
                                        size=10),
    ...:   'value1': np.arange(10,101,10),
    ...:   value2': np.random.randint(10,20,size=10)})
    
    ...: df
Out[30]: 
  sex  day size  value1  value2
0   M  Mon    L      10      15
1   F  Tue    L      20      11
2   M  Wed    L      30      14
3   M  Thu    L      40      10
4   M  Fri    S      50      19
5   F  Mon    L      60      15
6   M  Tue    M      70      18
7   M  Wed    S      80      10
8   M  Thu    M      90      19
9   F  Fri    M     100      12
In [31]: df.pivot_table(index=['sex'], 
                        values=['value2'])
Out[31]: 
        value2
sex           
F    12.666667
M    15.000000

It is grouped by sex columns and calculated the average value of value2 columns.

We can use more than one column.

In [32]: df.pivot_table(index=['sex', 'size'], 
                        values = ['value1','value2'])
Out[32]: 
              value1  value2
sex size                    
F   L      40.000000    13.0
    M     100.000000    12.0
M   L      26.666667    13.0
    M      80.000000    18.5
    S      65.000000    14.5

We can specify the function by the aggfunc parameter.

In [33]: df.pivot_table(index=['sex', 'size'], 
                        values = ['value1','value2'], 
                        aggfunc=np.sum)
Out[33]: 
          value1  value2
sex size                
F   L         80      26
    M        100      12
M   L         80      39
    M        160      37
    S        130      29

For adding grand total value, we can use the margins parameter with boolean values.


In [34]: df.pivot_table(index=['sex', 'size'], 
                        values = ['value1','value2'],
    ...:                aggfunc=np.sum, margins=True)
Out[34]: 
          value1  value2
sex size                
F   L         80      26
    M        100      12
M   L         80      39
    M        160      37
    S        130      29
All          550     143

We can specify the custom functions according to the value columns.


In [35]: df.pivot_table(index=['sex', 'size'], 
                values = ['value1','value2'],
    ...:        aggfunc = {'value1':[np.mean, np.min],
    ...:                   'value2':[np.max, np.std]})
Out[35]: 
         value1             value2          
           amin        mean   amax       std
sex size                                    
F   L      20.0   40.000000   15.0  2.828427
    M     100.0  100.000000   12.0       NaN
M   L      10.0   26.666667   15.0  2.645751
    M      70.0   80.000000   19.0  0.707107
    S      50.0   65.000000   19.0  6.363961

For the values that become missing values, while spreading the data, we can specify these missing values by the fill_value parameter. For example,


In [36]: df.pivot_table(index=['sex', 'size'], 
                        values = ['value2'], 
                        columns=['day'])
Out[36]: 
         value2                        
day         Fri   Mon   Thu   Tue   Wed
sex size                               
F   L       NaN  15.0   NaN  11.0   NaN
    M      12.0   NaN   NaN   NaN   NaN
M   L       NaN  15.0  10.0   NaN  14.0
    M       NaN   NaN  19.0  18.0   NaN
    S      19.0   NaN   NaN   NaN  10.0

We can fill these values with any scalar value we want. 
In [37]: df.pivot_table(index=['sex', 'size'], 
                        values = ['value2'], 
                        columns=['day'], 
                        fill_value='miss')
Out[37]: 
         value2                        
day         Fri   Mon   Thu   Tue   Wed
sex size                               
F   L      miss  15.0  miss  11.0  miss
    M      12.0  miss  miss  miss  miss
M   L      miss  15.0  10.0  miss  14.0
    M      miss  miss  19.0  18.0  miss
    S      19.0  miss  miss  miss  10.0

I think these methods will be useful for you. Thanks a lot for your time.




0 comments

Recent Posts

See All
bottom of page