top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Cleaning Data in Python

Writer's picture: Eman MahmoudEman Mahmoud

Data cleansing

is a process of detecting and rectifying (or deleting) of untrustworthy, inaccurate or outdated information from a data set, archives, table, or database.

You’ll learn techniques on how to find and clean:

  • Missing Data

  • Irregular Data (Outliers)

  • Unnecessary Data — Repetitive Data, Duplicates and more

we use the California Housing Prices from Kaggle.

About this file

  1. longitude: A measure of how far west a house is; a higher value is farther west

  2. latitude: A measure of how far north a house is; a higher value is farther north

  3. housingMedianAge: Median age of a house within a block; a lower number is a newer building

  4. totalRooms: Total number of rooms within a block

  5. totalBedrooms: Total number of bedrooms within a block

  6. population: Total number of people residing within a block

  7. households: Total number of households, a group of people residing within a home unit, for a block

  8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)

  9. medianHouseValue: Median house value for households within a block (measured in US Dollars)

  10. oceanProximity: Location of the house w.r.t ocean/sea


# import packages
import pandas as pd
import numpy as np
df = pd.read_csv("california-housing-prices-housing.csv")
print(df.shape)
print(df.dtypes)
df.head()


# select numeric columns
df_numeric = df.select_dtypes(include=[np.number])
numeric_cols = df_numeric.columns.values
print(numeric_cols)
['longitude' 'latitude' 'housing_median_age' 'total_rooms'
 'total_bedrooms' 'population' 'households' 'median_income'
 'median_house_value']
# select non numeric columns
df_non_numeric = df.select_dtypes(exclude=[np.number])
non_numeric_cols = df_non_numeric.columns.values
print(non_numeric_cols)
['ocean_proximity']

Missing Data Heatmap

When there is a smaller number of features, we can visualize the missing data via heatmap.

colours = ['#000099', '#ffff00'] 
# specify the colours - yellow is missing. blue is not #missing.
sns.heatmap(df.isnull(), cmap=sns.color_palette(colours))

The total_bedrooms feature only has little missing values around the 20640th row.

print(df.isnull().sum())
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

fill massing values in total_bedrooms with mean.

df['total_bedrooms'] = df['total_bedrooms'].fillna(df['total_bedrooms'].mean())

Irregular data (Outliers)

Outliers are data that is distinctively different from other observations. They could be real outliers or mistakes.

# box plot.
df.boxplot(column=['median_house_value'])


df['median_house_value'].describe()
count     20640.000000
mean     206855.816909
std      115395.615874
min       14999.000000
25%      119600.000000
50%      179700.000000
75%      264725.000000
max      500001.000000
Name: median_house_value, dtype: float64

The 500001 value is an outlier.

df[df['median_house_value'] == 500001]['median_house_value'].count()
965
print((965/ 20640)*100)
4.675387596899225

we have 4.6% of value 500001 we can decide to remove it or not but I see this prcentage big to remove it effect on analysis.

Unnecessary Data — Repetitive Data, Duplicates and more remove households column.

df =  df.drop(columns=['households'])

References:



0 comments

Recent Posts

See All

コメント


COURSES, PROGRAMS & CERTIFICATIONS

 

Advanced Business Analytics Specialization

Applied Data Science with Python (University of Michigan)

Data Analyst Professional Certificate (IBM)

Data Science Professional Certificate (IBM)

Data Science Specialization (John Hopkins University)

Data Science with Python Certification Training 

Data Scientist Career Path

Data Scientist Nano Degree Program

Data Scientist Program

Deep Learning Specialization

Machine Learning Course (Andrew Ng @ Stanford)

Machine Learning, Data Science and Deep Learning

Machine Learning Specialization (University of Washington)

Master Python for Data Science

Mathematics for Machine Learning (Imperial College London)

Programming with Python

Python for Everybody Specialization (University of Michigan)

Python Machine Learning Certification Training

Reinforcement Learning Specialization (University of Alberta)

Join our mailing list

Data Insight participates in affiliate programs and may sometimes get a commission through purchases made through our links without any additional cost to our visitors.

bottom of page