top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Detect and Impute Missing Values in Pandas

Hardly you will come across a dataset that is free from missing values. Missing values can be encoded in a number of ways. Sometimes when dealing with numeric data, missing data can be encoded with Numpy's NaN value. There are times when None or an empty string is used in place of a missing value. In this tutorial, we will first take a look at how to detect missing values and then impute them in the Data Frame. As usual, we will import Numpy and Pandas for this tutorial. Also, we will import statsmodels to be able to load the mtcars dataset. This is the dataset we will use for the tutorial.


import numpy as np
import pandas as pd
import statsmodels.api as sm

We will save the loaded dataset in Data Frame called df


df = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data

Let's take a quick look at the Data Frame


df.head()

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160.0 110 3.9 2.62 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160.0 110 3.9 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2


Now that the Data Frame is all set, we need to check if it contains missing values.


df.isnull().sum().sum()
output: 0

This Data Frame does not have missing values. So we will manually modify it to contain some. We will have `NaN` and `''` in some cells and also create a new column that only has values equal to `None`.


df.iloc[0:4, 3:4] = np.NaN
df.iloc[1:5, 1:2] = ''
df['col_none'] = None

Let's now have a quick look at the head of the Data Frame.


df.head()

mpg cyl disp hp drat wt qsec vs am gear carb col_none

Mazda RX4 21.0 6 160.0 3.9 2.62 16.46 0 1 4 4

Mazda RX4 Wag 21.0 160.0 3.9 2.875 17.02 0 1 4 4

Datsun 710 22.8 108.0 3.85 2.32 18.61 1 1 4 1

Hornet 4 Drive 21.4 258.0 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 360.0 175.0 3.15 3.44 17.02 0 0 3 2


Now we see that their are empty strings in some cells, NaN values in some, and a column with only None values We will again inspect to see how many na values there are in the Data Frame after the modification.

df.isnull().sum().sum()
output: 36

We now see that the total is 36!

We will now create a boolean Data Frame for null values, null.


null = pd.isnull(df)

A quick peek at the new Data Frame will give us an idea about how this looks.


null.head()

mpg cyl disp hp drat wt qsec vs am gear carb col_none

Mazda RX4 False False False True False False False False False False False True

Mazda RX4 Wag False False False True False False False False False False False True

Datsun 710 False False False True False False False False False False False True

Hornet 4 Drive False False False True False False False False False False False True

Hornet Sportabout False False False False False False False False False False False True


We see that the cells with True values are the ones with either NaN or None. But The empty strings are marked False. This is because the isnull() doesn't work on empty strings.

Suppose we thoroughly checked our df and came to the conclusion that only '', NaN, and None encode missing values. We can then have these in a list and create a Data Frame for the missing values.


missing_vals = [np.NaN, '', None]
missing = df.isin(missing_vals)

Let's take a look at the head of the new Data Frame.


missing.head()

mpg cyl disp hp drat wt qsec vs am gear carb col_none

Mazda RX4 False False False True False False False False False False False True

Mazda RX4 Wag False True False True False False False False False False False True

Datsun 710 False True False True False False False False False False False True

Hornet 4 Drive False True False True False False False False False False False True

Hornet Sportabout False True False False False False False False False False False True


To impute missing NaNs and Nones, the fillna() is enough. Lets take a look.


df.fillna(0).head()

mpg cyl disp hp drat wt qsec vs am gear carb col_none

Mazda RX4 21.0 6 160.0 0.0 3.9 2.62 16.46 0 1 4 4 0

Mazda RX4 Wag 21.0 160.0 0.0 3.9 2.875 17.02 0 1 4 4 0

Datsun 710 22.8 108.0 0.0 3.85 2.32 18.61 1 1 4 1 0

Hornet 4 Drive 21.4 258.0 0.0 3.08 3.215 19.44 1 0 3 1 0

Hornet Sportabout 18.7 360.0 175.0 3.15 3.44 17.02 0 0 3 2 0



But to impute/fill missing values for all the occurrences of missing values including empty strings, we will use the pandas mask method. Here, we will replace all missing values with the string '90000'


df = df.mask(missing, '90000')

Let's now see the head of the Data Frame.


df.head()

mpg cyl disp hp drat wt qsec vs am gear carb col_none

Mazda RX4 21.0 6 160.0 90000 3.9 2.62 16.46 0 1 4 4 90000

Mazda RX4 Wag 21.0 90000 160.0 90000 3.9 2.875 17.02 0 1 4 4 90000

Datsun 710 22.8 90000 108.0 90000 3.85 2.32 18.61 1 1 4 1 90000

Hornet 4 Drive 21.4 90000 258.0 90000 3.08 3.215 19.44 1 0 3 1 90000

Hornet Sportabout 18.7 90000 360.0 175.0 3.15 3.44 17.02 0 0 3 2 90000


But to make sure there are no missing values, let's run the cell below to give us the count of missing values.


df.isnull().sum().sum()
0

Yaay! Mission accomplished!


0 comments

Recent Posts

See All

Kommentare


bottom of page