Cross tabulation is a technique that allows you to summarize the data in categorical variables and examine it to determine if there are any associations present. When implemented in Pandas python library, it by default computes a frequency table of the factors unless an array of values and an aggregation function are passed.
Out of the many parameters the function supports, the following are used to demonstrate how to use it.
columns : array-like, Series, or list of arrays/Series
Values to group by in the columns.
values : array-like, optional
Array of values to aggregate according to the factors. Requires aggfunc be specified.
aggfunc : function, optional
If specified, requires values be specified as well.
margins : bool, default False
Add row/column margins (subtotals).
margins_name : str, default ‘All’
Name of the row/column that will contain the totals when margins is True.
Returns : DataFrame
Cross tabulation of the data.
Let's first see the basic usage of panda's crosstab function. We begin by importing the dataset and wrangle it a bit to modify the headers and filter only records that come from the subset of car models we select. The following snippet does all that.
import pandas as pd # Define the headers since the data does not have any headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style", "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight", "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"] # Read in the CSV file and convert "?" to NaN df_raw = pd.read_csv("datasets/imports-85.data", header=None, names=headers, na_values="?" ) # Define a list of models that we want to review models = ["toyota","nissan","mazda", "honda", "mitsubishi", "subaru", "volkswagen", "volvo"] # Create a copy of the data with only the top 8 manufacturers df = df_raw[df_raw.make.isin(models)].copy()
Once we have the data imported, wrangled and filtered, its lets see how we can achieve a simple crosstab.
The output of the above snippet is this:
The interpretation of the above code is the values reflect the count of all records with each car make having diesel/gas fuel type.
Let's enhance the result a bit to include an additional calculated field named 'Total' that aggregates the values per row and append it to the crosstab table. That's achievable by specifying the margins and margins_name params.
pd.crosstab(df.make, df.fuel_type, margins=True, margins_name="Total")
We can chance the aggregation function and the values aggregated if we wish using the values and aggfunc params.
pd.crosstab(df.make, df.body_style, values=df.curb_weight, aggfunc='mean').round(0)
To make the analysis even more interesting, we can group the columns as needed so that the result reflects aggregates based on the grouping we specify.
pd.crosstab(df.make, [df.body_style, df.drive_wheels])
2. Imputing missing values
Real world datasets are not perfect obviously and that means there are missing values here and there. Dealing with missing values is one of the foremost data preparation tasks required before we consider the dataset for analysis.
There are many different approaches that can be taken to fill in missing values with pandas and lets see them one by one.
Lets consider the following dataset which has missing values.
2.1) Filling with custom value with DataFrame.fillna() method
2.2) Fill gaps forward or backward
The following code illustrates filling missing values from the leading rows.
2.3) Filling with computed values
It is also possible to compute the missing values from the known observations as follow.
2.4) Drop dataset labels that have missing values
Both Series and DataFrame objects have interpolate() that, by default, performs linear interpolation at missing data points.
3. 10 very useful summary (aggregation) functions
mean(): Compute mean of groups
sum(): Compute sum of group values
size(): Compute group sizes
count(): Compute count of group
describe(): Generates descriptive statistics
first(): Compute first of group values
last(): Compute last of group values
nth() : Take nth value, or a subset if n is a list
min(): Compute min of group values
max(): Compute max of group values
4. Groupwise analysis
It is a common practice to perform different data analytics operations on a group of data instead of at individual row level. Pandas offer the groupby() and many other methods to accomplish such tasks.
The following example illustrates how we can count how many of each car make is in the dataset used above.
The same result could also simply be achieved with one method value_counts().
Furthermore, we can use aggregate functions like min(),max(), sum() etc on the grouped data as follow.
5. Selecting a subset of data from a dataset
Index based selection with iloc
To select the first five rows of the dataset, we can use the following function.
To target the first two columns in the result, another range can be specified like this.
Label based selection with loc.
We can on the other hand use the loc selector along with a condition to apply as a criteria for selecting the subset we need.
df.iloc[df.price > 10000]