top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureMariam Ahmed

Cleaning Data in Python



Cleaning The Dataset

Data cleaning is important because, if data is wrong or contains missing values, outcomes and algorithms will be unreliable.

So, we must clean it first before doing any process on the data.

# importing pandas to build a data frame from the csv file
import pandas as pd

# Importing the dataset as csv file
myData = pd.read_csv('companies.csv')

# Display the first ten rows of dataset 
myData.head(10)

1. Detecting Null Values

# Display null values
myData.isna()

# Display the total number of the null values in every column
myData.isna().sum()
ID           0
Name         0
Industry     2
Inception    1
Employees    2
State        4
City         0
Revenue      2
Expenses     3
Profit       2
Growth       1
dtype: int64
myData.isna().sum().sum()
17

2.1. Filling Null Values

# Filling all null values with a static number such as 0
demo1 = myData.fillna(value=0)
demo1.head(10)

# Filling each null value with its previous value 
demo2 = myData.fillna(method='pad')
demo2.head(10)

# Filling each null value with its following value 
demo3 = myData.fillna(method= 'bfill')
demo3.head(10)

# Filling numeric values with the mean value in a specific column
mn = myData["Employees"].mean()
myData['Employees'].fillna(mn, inplace=True)
myData.head(10)

myData = pd.read_csv('companies.csv')

# Filling numeric values with the median value in a specific column
med = myData["Employees"].median()
myData["Employees"].fillna(med, inplace = True)
myData.head(10)

myData = pd.read_csv('companies.csv')

# Filling numeric values with the mode value in a specific column
mod = myData["Employees"].mode()[0]
myData["Employees"].fillna(mod, inplace = True)
myData.head(10)

2.2. Drop All Null Values Instead Of Filling It


myData = pd.read_csv('companies.csv')

# Drop all null values
data = myData.dropna()
data.head(10)

# Display the null values detection in each column
data.isna().sum()
ID           0
Name         0
Industry     0
Inception    0
Employees    0
State        0
City         0
Revenue      0
Expenses     0
Profit       0
Growth       0
dtype: int64


3. Wrong Formatted Data


Note: null values can be considered as wrong formatted data That we can easily fill or drop from the dataset as previous.


We can solve the wrong formatted data by converting the column into the same data format. In our dataset, we have a float number in the 'Employees' column and its third-row (after dropping nulls) has 10.5 employees which is not logical. So, we convert it to integer format to solve this logical error.


data['Employees'] = data['Employees'].astype(int)

# The 'Inception' year also must be integer, so we convert it also.
data['Inception'] = data['Inception'].astype(int)
data.head(10)

4. Wrong Data


It needn't be null values or wrong formatted data, it can be wrong like if someone writes "20000" instead of "2000" in the Inception year, although we are still in the 2022 year.

if the data set is small, you can look at it and change the wrong data one by one, but the huge ones need some sophisticated code blocks to detect and correct it, or you can drop it. In our dataset, we note that the fifth and the ninth rows have wrong data in the 'Inception' column so, we need to correct it.

data.loc[4, 'Inception'] = 2013
data.loc[8, 'Inception'] = 2009
data.head(10)

5. Removing Duplicates


Duplicated rows are more than one row that has the same data. That can affect the processing of the data badly.

We need to check if there are any duplicates or not using duplicated() that return 'True' if the row is duplicated and 'False' otherwise.

data.duplicated()
0      False
1      False
3      False
4      False
5      False
       ...  
496    False
497    False
498    False
499    False
500     True
Length: 489, dtype: bool
data.loc[4, 'Inception'] = 2013
data.loc[8, 'Inception'] = 2009
data.head(10)

We note that the row with ID 500 is duplicated. So, we need to drop the duplicated row

data.drop_duplicates(inplace = True)
data.duplicated()
0      False
1      False
3      False
4      False
5      False
       ...  
495    False
496    False
497    False
498    False
499    False
Length: 488, dtype: bool

So, finally, we can say that we clean the dataset and now we can move to the next step.


That's it, I hope this article was worth reading and helped you acquire new knowledge no matter how small.


Feel free to check up on the notebook. You can find the results of code samples in this post.

0 comments

Recent Posts

See All

Comentários


bottom of page