top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

IMPORTING AND CLEANING DATA IN PYTHON

Importing Data In Python

We need datasets for data analysis. Python has various modules which help us in importing the external data in various file formats to a python program. There are various file types for data which are text files, SAS files, Stata File, Excel Spreadsheets, relational databases, pickled files, Matlab files, HDF5 files, and others. We will be going through how to import some of these various data file types for analysis.

Importing Text Files In Python

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. Below is a code to show how it works using google collaboratory:

#Uploading file from local computer to google collab
from google.colab import files
uploaded = files.upload()

#Opening file to read first 3 lines
with open('Pitch.txt', 'r') as file:
  print(file.readline()) 
  print(file.readline())
  print(file.readline())

Output:

In your pitch;  
1.We want you to clearly define your problem  
2.We want you to clearly define your solution

Importing SAS Files In Python

A file with the SAS file extension is a Statistical Analysis Software (SAS) file. The SAS file is an ASCII (text) file that contains a series of SAS functions that may be run against a data set, or a SAS file may contain the actual data set.

#Uploading file from local to google collab
from google.colab import files
sas_file = files.upload()

#Downloading the sas7bdat library
!pip install sas7bdat

from sas7bdat import SAS7BDAT
with SAS7BDAT('eventrepository.sas7bdat') as file:
  df_sas = file.to_data_frame()

df_sas.head()

Output:


Importing Excel Spreadsheets In Python

A spreadsheet is a file that exists of cells in rows and columns and can help arrange, calculate and sort data. Data in a spreadsheet can be numeric values, as well as text, formulas, references, and functions.

#Uploading file from local computer to google collab
from google.colab import files
xlsx = files.upload()

#importing pandas
import pandas as pd

#reading the excel worksheet
excel_file = pd.read_excel('indicadores3.xlsx')
excel_file.head()

Output:


Importing CSV Files In Python

A comma-separated values file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

#Uploading file from local computer to google collab
from google.colab import files
csv_file = files.upload()

#Importing the pandas library
import pandas as pd

#Reading the csv file user.csv
csv_data = pd.read_csv('user.csv')
csv_data.head()

Output:


Cleaning Data In Python

Data cleaning or cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a recordset, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. The issues with data usually include missing data and irregular data (outliers).


Missing Data

Dealing with missing data/values is one of the most tricky but common parts of data cleaning. While many models can live with other problems with the data, most models don’t accept missing data.

One technique for checking for missing data is using the isnull function.

#Uploading file from local computer to google collab
from google.colab import files
csv_file = files.upload()

#Importing the pandas library
import pandas as pd

#Reading the csv file user.csv
csv_data = pd.read_csv('user.csv')
csv_data.head()

#Checking for null values or missing values using the isnull function
csv_data.isnull().sum()

Output:

u_id            0 
name            0 
mobile          0 
wallet_id       0 
when_created    0 
dtype: int64

One other technique for checking for missing data is using the heatmap

import seaborn as sns
cols = csv_data.columns[:6] # first 30 columns
colours = ['#000099', '#ffff00'] # specify the colours - yellow is missing. blue is not missing.
sns.heatmap(csv_data[cols].isnull(),cmap=sns.color_palette(colours))

Output:


Our data is free from missing values. All observations are captured with values.

Regardless, the problem of missing data can be solved by dropping the observations, dropping the features, or imputing missing values.

Dropping the observations is as follows:

#Dropping missing values from the dataset
csv_data_no_missing_row_values = csv_data.dropna()
#Displaying info after dropping missing values.
csv_data_no_missing_row_values.info()

Output:

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 100 entries, 0 to 99 
Data columns (total 5 columns): 
#   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
0   u_id          100 non-null    int64  
1   name          100 non-null    object 
2   mobile        100 non-null    int64  
3   wallet_id     100 non-null    int64  
4   when_created  100 non-null    object 
dtypes: int64(3), object(2) 
memory usage: 4.7+ KB

Dropping the features is as follows:

# Assuming mobile column has a lot of missing.
# If we want to drop.
cols_to_drop = ['mobile']
csv_data_without_mobile_column = csv_data.drop(cols_to_drop, axis=1)
csv_data_without_mobile_column.head()

Output:


Imputing missing values:

#Calculating the median value
med = csv_data['mobile'].median()

#Printing the median value to console
print(med)

#Filling all missing values with the median value.
csv_data['mobile'] = csv_data['mobile'].fillna(med)

Output:

550000049.5

Irregular data (Outliers)

Outliers are data that is distinctively different from other observations. They could be real outliers or mistakes. Techniques for finding outliers are: histogram/boxplot or descriptive statistics.

The histogram technique is used when the feature is numeric. The technique is as follows:

#Uploading file from local computer to google collab
from google.colab import files
csv_file = files.upload()

#Importing the pandas library
import pandas as pd

#Reading the csv file user.csv
csv_data = pd.read_csv('user.csv')
csv_data.head()

# histogram of mobile.
csv_data['mobile'].hist(bins=100)

Output:


The boxplot technique is used when the feature is numeric. The technique is as follows:

#Uploading file from local computer to google collab
from google.colab import files
csv_file = files.upload()

#Importing the pandas library
import pandas as pd

#Reading the csv file user.csv
csv_data = pd.read_csv('user.csv')
csv_data.head()

# box plot.
csv_data.boxplot(column=['mobile'])

Output:


The descriptive statistic technique is also used for numeric features. The technique is as follows:

#Uploading file from local computer to google collab
from google.colab import files
csv_file = files.upload()

#Importing the pandas library
import pandas as pd

#Reading the csv file user.csv
csv_data = pd.read_csv('user.csv')
csv_data.head()

# box plot.
csv_data.boxplot(column=['mobile'])

Output:

count    1.000000e+02 
mean     5.500000e+08 
std      2.901149e+01 
min      5.500000e+08 
25%      5.500000e+08 
50%      5.500000e+08 
75%      5.500001e+08 
max      5.500001e+08 
Name: mobile, dtype: float64

The above methods show that the data being analyzed is free from irregular data/outliers.

Regardless, while outliers are not hard to detect, we have to determine the right solutions to handle them. It highly depends on the dataset and the goal of the project. The methods of handling outliers are somewhat similar to missing data. We either drop or adjust or keep them.



0 comments

Recent Posts

See All
bottom of page