# Cleaning Data in Python

## Data cleansing

## is a process of detecting and rectifying (or deleting) of untrustworthy, inaccurate or outdated information from a data set, archives, table, or database.

## You’ll learn techniques on how to find and clean:

Missing Data

Irregular Data (Outliers)

Unnecessary Data — Repetitive Data, Duplicates and more

## we use the California Housing Prices from Kaggle.

About this file

longitude: A measure of how far west a house is; a higher value is farther west

latitude: A measure of how far north a house is; a higher value is farther north

housingMedianAge: Median age of a house within a block; a lower number is a newer building

totalRooms: Total number of rooms within a block

totalBedrooms: Total number of bedrooms within a block

population: Total number of people residing within a block

households: Total number of households, a group of people residing within a home unit, for a block

medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)

medianHouseValue: Median house value for households within a block (measured in US Dollars)

oceanProximity: Location of the house w.r.t ocean/sea

```
# import packages
import pandas as pd
import numpy as np
```

```
df = pd.read_csv("california-housing-prices-housing.csv")
print(df.shape)
print(df.dtypes)
df.head()
```

```
# select numeric columns
df_numeric = df.select_dtypes(include=[np.number])
numeric_cols = df_numeric.columns.values
print(numeric_cols)
```

```
['longitude' 'latitude' 'housing_median_age' 'total_rooms'
'total_bedrooms' 'population' 'households' 'median_income'
'median_house_value']
```

```
# select non numeric columns
df_non_numeric = df.select_dtypes(exclude=[np.number])
non_numeric_cols = df_non_numeric.columns.values
print(non_numeric_cols)
```

`['ocean_proximity']`

## Missing Data Heatmap

## When there is a smaller number of features, we can visualize the missing data via heatmap.

```
colours = ['#000099', '#ffff00']
# specify the colours - yellow is missing. blue is not #missing.
sns.heatmap(df.isnull(), cmap=sns.color_palette(colours))
```

The total_bedrooms feature only has little missing values around the 20640th row.

`print(df.isnull().sum())`

```
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 207
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0
dtype: int64
```

fill massing values in total_bedrooms with mean.

`df['total_bedrooms'] = df['total_bedrooms'].fillna(df['total_bedrooms'].mean())`

## Irregular data (Outliers)

## Outliers are data that is distinctively different from other observations. They could be real outliers or mistakes.

```
# box plot.
df.boxplot(column=['median_house_value'])
```

`df['median_house_value'].describe()`

```
count 20640.000000
mean 206855.816909
std 115395.615874
min 14999.000000
25% 119600.000000
50% 179700.000000
75% 264725.000000
max 500001.000000
Name: median_house_value, dtype: float64
```

## The 500001 value is an outlier.

`df[df['median_house_value'] == 500001]['median_house_value'].count()`

`965`

`print((965/ 20640)*100)`

`4.675387596899225`

we have 4.6% of value 500001 we can decide to remove it or not but I see this prcentage big to remove it effect on analysis.

Unnecessary Data — Repetitive Data, Duplicates and more remove households column.

`df = df.drop(columns=['households'])`

Have a look at code: https://www.kaggle.com/emancs/cleaning-data-in-python