top of page

Data Scientist Program


Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

How to import and Clean Data with Python

Importing Data

Loading and Saving CSVs:

When you have data in a CSV, you can load it into a DataFrame in Pandas using .read_csv():

df= pd.read_csv('IMDB_Movies.csv')

Cleaning Data

Diagnose the Data:

We often describe data that is easy to analyze and visualize as “tidy data”. What does it mean to have tidy data? For data to be tidy, it must have:

  • Each variable as a separate column

  • Each row as a separate observation gives some statistics for each column.

Dealing with Duplicates:

Often we see duplicated rows of data in the DataFrames we are working with. This could happen due to errors in data collection or in saving and loading the data. To check for duplicates, we can use the pandas function .duplicated(), which will return a Series telling us which rows are duplicate rows.


We can use the pandas .drop_duplicates() function to remove all rows that are duplicates of another row.


Missing Values:

We often have data with missing elements, as a result of a problem with the data collection process or errors in the way the data was stored. The missing elements normally show up as NaN (or Not a Number) values.


If we wanted to remove every row with a NaN value in the director_name column only, we could specify a subset:

df = df.dropna(subset=['director_name'])

Looking at Types:

Each column of a DataFrame can hold items of the same data type or dtype. The dtypes that pandas uses are: float, int, bool, datetime, timedelta, category and object. Often, we want to convert between types so that we can do better analysis.

To see the types of each column of a DataFrame, we can use:


You can check the full code here


Recent Posts

See All
bottom of page