IMPORTING, CLEANING AND VISUALIZING DATA
Importing, cleaning, and visualizing data are important aspects of a data science project workflow as they determine the outcome of the data science tool or project that has been carried out. Hence every data science project meticulously undergo these three process as a form of quality assurance.
One mistake in one of these concepts can affect the others leading to a huge problem. Therefore crucial steps are taken to ensure that a set of data is correctly imported, cleaned with extreme precision, and then visualized to tell good stories.
Importing Data in python
Working with data is a very important task and python has just the right tools and modules to help us do that. Python is a powerful programming language to work with datasets as it helps us import external datasets in various formats such as CSV, excel, hdf5, SAS, and even databases, etc. Although python has a very efficient module called open() that allows us to work with files, python's data science module pandas an even more powerful library for working with datasets as it efficiently helps us import and manipulation data in various formats. This can be done by simply importing pandas and importing the dataset using the specific panda's importation functions eg pf.read_csv('file'), pd.read_excel('file'). Below is a code example to help you understand better.
#first we import pandas
import pandas as pd
#Since we are working with Google colab and not Jupiter notebook we need to import our dataset from our machine or computer into colab
from google.colab import files
#the file imported above is a CSV file we are going to load it using pandas in order to get a Dataframe
df=pd.read_csv('students_adaptability_level_online_education.csv')
#we can see the dataset using the head method
df.head()
#We just uploaded an excel file from the machine into the Google collab environment
df2=pd.read_excel('CAS Chem Name SMILES Data Set CID.5.9.22.xlsx')
df2.head()
Cleaning data in python
Data cleaning is a very important part of a data science project workflow and also an important skill every data scientist should have. Not only does it aid data visualization it helps fit data in a machine learning and statistical model. Data cleaning involves the process of observing and correcting dirty or inaccurate datasets. It usually entails modifying, changing, and sometimes entirely deleting dirty data. Some major data issues in data science include:
Missing data
Redundant and unnecessary data