Cleaning Data in Python

Data cleansing
is a process of detecting and rectifying (or deleting) of untrustworthy, inaccurate or outdated information from a data set, archives, table, or database.
You’ll learn techniques on how to find and clean:
Missing Data
Irregular Data (Outliers)
Unnecessary Data — Repetitive Data, Duplicates and more
we use the California Housing Prices from Kaggle.
About this file
longitude: A measure of how far west a house is; a higher value is farther west
latitude: A measure of how far north a house is; a higher value is farther north
housingMedianAge: Median age of a house within a block; a lower number is a newer building
totalRooms: Total number of rooms within a block
totalBedrooms: Total number of bedrooms within a block
population: Total number of people residing within a block
households: Total number of households, a group of people residing within a home unit, for a block
medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
medianHouseValue: Median house value for households within a block (measured in US Dollars)
oceanProximity: Location of the house w.r.t ocean/sea
df = pd.read_csv("california-housing-prices-housing.csv")
print(df.shape)
print(df.dtypes)
df.head()

# select numeric columns
df_numeric = df.select_dtypes(include=[np.number])
numeric_cols = df_numeric.columns.values
print(numeric_cols)
['longitude' 'latitude' 'housing_median_age' 'total_rooms'
'total_bedrooms' 'population' 'households' 'median_income'
'median_house_value']
# select non numeric columns
df_non_numeric = df.select_dtypes(exclude=[np.number])
non_numeric_cols = df_non_numeric.columns.values
print(non_numeric_cols)
['ocean_proximity']
Missing Data Heatmap
When there is a smaller number of features, we can visualize the missing data via heatmap.
colours = ['#000099', '#ffff00']
# specify the colours - yellow is missing. blue is not #missing.
sns.heatmap(df.isnull(), cmap=sns.color_palette(colours))

The total_bedrooms feature only has little missing values around the 20640th row.
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 207
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0
dtype: int64
fill massing values in total_bedrooms with mean.
Irregular data (Outliers)
Outliers are data that is distinctively different from other observations. They could be real outliers or mistakes.
# box plot.
df.boxplot(column=['median_house_value'])

count 20640.000000
mean 206855.816909
std 115395.615874
min 14999.000000
25% 119600.000000
50% 179700.000000
75% 264725.000000
max 500001.000000
Name: median_house_value, dtype: float64
The 500001 value is an outlier.
965
print((965/ 20640)*100)
4.675387596899225
we have 4.6% of value 500001 we can decide to remove it or not but I see this prcentage big to remove it effect on analysis.
Unnecessary Data — Repetitive Data, Duplicates and more remove households column.
Have a look at code: https://www.kaggle.com/emancs/cleaning-data-in-python
コメント