How to import and Clean Data with Python

Fatma Ali
Dec 8, 2021
1 min read

Importing Data

Loading and Saving CSVs:

When you have data in a CSV, you can load it into a DataFrame in Pandas using .read_csv():

df= pd.read_csv('IMDB_Movies.csv')
df

Cleaning Data

Diagnose the Data:

We often describe data that is easy to analyze and visualize as “tidy data”. What does it mean to have tidy data? For data to be tidy, it must have:

Each variable as a separate column
Each row as a separate observation

df.info() gives some statistics for each column.

df.info()

Dealing with Duplicates:

Often we see duplicated rows of data in the DataFrames we are working with. This could happen due to errors in data collection or in saving and loading the data. To check for duplicates, we can use the pandas function .duplicated(), which will return a Series telling us which rows are duplicate rows.

df.duplicated()

We can use the pandas .drop_duplicates() function to remove all rows that are duplicates of another row.

df.drop_duplicates(subset=['director_name'])
df

Missing Values:

We often have data with missing elements, as a result of a problem with the data collection process or errors in the way the data was stored. The missing elements normally show up as NaN (or Not a Number) values.

df.isnull().sum()

If we wanted to remove every row with a NaN value in the director_name column only, we could specify a subset:

df = df.dropna(subset=['director_name'])
df

Looking at Types:

Each column of a DataFrame can hold items of the same data type or dtype. The dtypes that pandas uses are: float, int, bool, datetime, timedelta, category and object. Often, we want to convert between types so that we can do better analysis.

To see the types of each column of a DataFrame, we can use:

print(df.dtypes)

You can check the full code here

datainsightonline.com

Data Scientist Program

Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

How to import and Clean Data with Python

Recent Posts

Comments

40 Python Projects with Source Code for Beginners

How to Read Medium Premium Articles for Free

How to use Sqlite3 using Python

Data Visualization - which types of graphs should we use?

Best Online Courses for Data Science

9 Ways to Embed Code Snippets on your Data Science Blog Posts