Important Pandas techniques you need to know as a Data role

Abdi Tolesa
Nov 20, 2021
3 min read

Updated: Nov 21, 2021

1. Crosstabs

Cross tabulation is a technique that allows you to summarize the data in categorical variables and examine it to determine if there are any associations present. When implemented in Pandas python library, it by default computes a frequency table of the factors unless an array of values and an aggregation function are passed.

The syntax:

pandas.crosstab(parameters)

Out of the many parameters the function supports, the following are used to demonstrate how to use it.

columns : array-like, Series, or list of arrays/Series

Values to group by in the columns.

values : array-like, optional

Array of values to aggregate according to the factors. Requires aggfunc be specified.

aggfunc : function, optional

If specified, requires values be specified as well.

margins : bool, default False

Add row/column margins (subtotals).

margins_name : str, default ‘All’

Name of the row/column that will contain the totals when margins is True.

Returns : DataFrame

Cross tabulation of the data.

Let's first see the basic usage of panda's crosstab function. We begin by importing the dataset and wrangle it a bit to modify the headers and filter only records that come from the subset of car models we select. The following snippet does all that.

import pandas as pd

# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df_raw = pd.read_csv("datasets/imports-85.data", header=None, names=headers, na_values="?" )

# Define a list of models that we want to review
models = ["toyota","nissan","mazda", "honda", "mitsubishi", "subaru", "volkswagen", "volvo"]

# Create a copy of the data with only the top 8 manufacturers
df = df_raw[df_raw.make.isin(models)].copy()

Once we have the data imported, wrangled and filtered, its lets see how we can achieve a simple crosstab.

pd.crosstab(df.make, df.fuel_type)

The output of the above snippet is this:

The interpretation of the above code is the values reflect the count of all records with each car make having diesel/gas fuel type.

Let's enhance the result a bit to include an additional calculated field named 'Total' that aggregates the values per row and append it to the crosstab table. That's achievable by specifying the margins and margins_name params.

pd.crosstab(df.make, df.fuel_type, margins=True, margins_name="Total")

We can chance the aggregation function and the values aggregated if we wish using the values and aggfunc params.

pd.crosstab(df.make, df.body_style, values=df.curb_weight, aggfunc='mean').round(0)

To make the analysis even more interesting, we can group the columns as needed so that the result reflects aggregates based on the grouping we specify.

pd.crosstab(df.make, [df.body_style, df.drive_wheels])

2. Imputing missing values

Real world datasets are not perfect obviously and that means there are missing values here and there. Dealing with missing values is one of the foremost data preparation tasks required before we consider the dataset for analysis.

There are many different approaches that can be taken to fill in missing values with pandas and lets see them one by one.

Lets consider the following dataset which has missing values.

2.1) Filling with custom value with DataFrame.fillna() method

df.fillna(0)

Output:

2.2) Fill gaps forward or backward

The following code illustrates filling missing values from the leading rows.

df.fillna(method="bfill")

Output:

2.3) Filling with computed values

It is also possible to compute the missing values from the known observations as follow.

df.fillna(df.mean()["two":"three"])

2.4) Drop dataset labels that have missing values

df.dropna(axis=1)

Output:

2.5) Interpolation

Both Series and DataFrame objects have interpolate() that, by default, performs linear interpolation at missing data points.

df[["one", "two"]].interpolate()

3. 10 very useful summary (aggregation) functions

mean(): Compute mean of groups
sum(): Compute sum of group values
size(): Compute group sizes
count(): Compute count of group
describe(): Generates descriptive statistics
first(): Compute first of group values
last(): Compute last of group values
nth() : Take nth value, or a subset if n is a list
min(): Compute min of group values
max(): Compute max of group values

4. Groupwise analysis

It is a common practice to perform different data analytics operations on a group of data instead of at individual row level. Pandas offer the groupby() and many other methods to accomplish such tasks.

The following example illustrates how we can count how many of each car make is in the dataset used above.