top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's picturePrativa Bhatta

Importing, Cleaning and Visualizing Data in Python

Examples of the Data Files;

Excel file, CSV file, Text file, HDF5 file, MATLAB files, RDBMS, etc.

Using these different files you import them differently,

One popular option for importing data is to use the pandas,

especially for a huge dataset. I have already discussed some pandas functions here

In that regard, before you can import the dataset, we also need to import pandas

# Importing needed library
import pandas as pd

Now, we can load the data needed.

Here we will discuss the common and basic data files i.e. Excel file and CSV file.

Excel File

Importing an excel file, the common way normally used is pd.read_excel()

pd.read_excel() : loads an excel file with the extensions; xls, xlsx, etc. into a DataFrame for further analysis using python

STEPS TO LOAD

> Copy the excel file path.

Suppose this is the file path that includes the file name and the file extension

"C:\Users\HP\Desktop\data analysis sheet.xlsx"

> Apply the code.

df = pd.read_excel(r"C:\Users\HP\Desktop\data analysis sheet.xlsx")

df

The r before the file path string .... r"c because of the special character \ in the path.

or,

df = df.read_excel("r"C:\Users\HP\Desktop\data analysis sheet.xlsx")

> Run the codes

Run the code by calling if

After this code has run, it loads the excel file into the pandas DataFrame.

 df.head()
 

CSV file

Another data file is a CSV file (comma-separated values) which is a delimited text file that uses a comma to separate values.

You can import a csv by calling

pd.read_csv()

pd.read_csv() : This loads a csv file into a DataFrame.

It follows the same steps as loading an excel file but here the file extension differ

Looks like this,

data =pd.read_csv(r"C:\Users\HP\Desktop\data analysis sheet.csv")

data.head()

Note that with the excel file you use pd.read_excel and with the CSV file you used pd.read_csv

Now, we know some basics of importing data in Python.

There are other ways you can also import data once that data is not on your laptop like web scraping using HTML, which I am not going to dive deeper into. But we extract from the websites by importing BeautifulSoup and request.

import request
from bs4 import BeautifulSoup
url = '' example "
r = request.get(url)
html.doc = r.text
Soup = BeautifulSoup(html.doc)
Pretty_Soup = Soup.pretty()
print(pretty_Soup)

Now import some basic files.


2. DATA CLEANING

What is Data Cleaning?

From Wikipedia,

Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a recordset, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.


After the data has been imported, you can now explore the data. First and foremost you need to clean the data.

Why do you need to clean the data?

You need to clean the data because you can not make meaningful insights from messy and dirty data. As the saying goes - "Garbage In, Garbage Out".

Using messy and dirty data for analysis, predictions, and making inferences will not cause only wrong outcomes but also bad decisions. Hence the need to clean data.


How to clean the data

Using a dataset from the Kwame Nkrumah University of Science and Technology. About the less endowed students(High School students in the deprived communities).

You can clean a data by

1. Checking for Missing Values

2. Dropping Irrelevant Features

3. etc.

You need to import the needed libraries;

import pandas as pd
import numpy as np


Load the necessary dataset. We have already discussed how to;

data =pd.read_csv(r"C:\Users\HP\Desktop\data analysis sheet.csv")

data.head()

Running head() gives the top five from the dataset.

Checking for Missing Values

Missing Values are values that were not recorded in a dataset. They can be a single missing value in a cell or missing values in the entire observation (rows or columns).

To know the distinct items within a column per count use value_counts

df['data'].value_counts()
Dropping Missing Values

But in an instance where there are missing you can deal with them by dropping the entire row or column.

Dropping column

df.drop['example', axis = 1]

Dropping row

df.drop['example', axis = 0]

This drops the entire column or row with the name example. With this approach, you lose some essential information in the data that can be useful.

Hence not a best practice.

Replacing Missing Values

On the other hand, you can input values to replace these null values. Replacing helps you to overwrite the data.

Example:

df['example'].replace('x', 'y', inplace = True)

With inplace = True, any change is done permanently. To avoid this and have the original dataset, you can assign a variable name

df.['example'] = df.['example'].replace('x', 'y')

Replacing NAN values

You will use the python code df.isnull().sum() to check for the "null" values in the dataset.

From the dataset, we are using;

knust.isnull().sum()

when we have null values and you want fill them 
df['example'] = df[;example'].fillna(' ')

You need not to replace all the null values at once. You can replace them with the mean, median, the highest occurring value, or the lowest occurring value.

3. Data Visualization

Clean and well-structured data is ready for analysis. One best way to make meaningful insight from data is to visualize it.

**Good visualization allows you to communicate your data to other consumers for easy understanding.

Before you Visualize, you need to import the necessary libraries. Commonly used libraries are the matplotlib and seaborn but there are others that you can explore

import matplotlib.pyplot as plt
%matplotlib notebook
import seaborn as sns

Matplotlib

This is one popular library in python for plotting. It is low leveled as compared to seaborn. Before you can use it you need to install it i.e.

pip install matplotlib

Matplotlib is essentially used for creating basic and popular graphs which include the bar chart, histogram, line chart, and others.


Bar Chart; To plot a bar chart, you use

df.plot(kind='bar')
plot.bar()
Now you can import, clean and visualize data!!
0 comments

Recent Posts

See All

header.all-comments


bottom of page