top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Means of Importing Data in Python


Introduction

We may have different data sources and data types to work on and python has a functionality to process and read these types of data for further analysis and use. Two of the key things considered when trying to import data in pandas will be file format and file path where the data is stored.

A file format is a typical way in which data is encoded for storage in a file. To identify a file format, you can usually look at the extension of the file to get an idea. For example, a file saved with name “Data” in “CSV” format will appear as “Data.csv”. Python has a functionality to read 14 different types of file formats namely, Comma-separated values (CSV), XLSX, ZIP, Plain Text (txt), JSON, XML, HTML, Images, Hierarchical Data Format, PDF, DOCX, MP3, and MP4. Let us see how we can import some of the mentioned file formats.

1. Importing CSV files

A Comma Separated Values (CSV) file is a plain text file that contains a list of data. These files are often used for exchanging data between different applications.

The procedure to read or import into python will looks like the following.

A. Import Pandas

B. Use df.read_csv(“Filepath) as seen in the code snippet below.


#Import Pandas 
import pandas as pd
df_Athletes = pd.read_csv(r"C:\Users\Yosef\OneDrive - cumc.columbia.edu\Desktop\DATA\Athletes.csv")
df_Athletes.head(10)

Output


Name    NOC     Discipline
0       AALERUD Katrine Norway  Cycling Road
1       ABAD Nestor     Spain   Artistic Gymnastics
2       ABAGNALE Giovanni       Italy   Rowing
3       ABALDE Alberto  Spain   Basketball
4       ABALDE Tamara   Spain   Basketball
5       ABALO Luc       France  Handball
6       ABAROA Cesar    Chile   Rowing
7       ABASS Abobakr   Sudan   Swimming
8       ABBASALI Hamideh        Islamic Republic of Iran        Karate
9       ABBASOV Islam   Azerbaijan      Wrestling

Based on the above mentioned output we successfully imported athletes data set which was originally saved in the form of csv.

After importing of the data we are expected to see and manage if there is any null value. let us us isna() function to check null values.


# to See whether there is any null value in the data set use ISNA
df_Athletes.isna().sum()

Output


Name          0
NOC           0
Discipline    0
dtype: int64

The output indicates that there is no any missing value in the athletes dataset which is comprised of three columns.

Another option of importing data is by importing CSV rather than that of pandas as follows. Python has a built-in open () function to open a file. This function returns a file object, also called a handle, as it is used to read or modify the file accordingly.



import csv
with open (r"C:\Users\Yosef\OneDrive - cumc.columbia.edu\Desktop\DATA\Athletes.csv", newline = '') as csvfile:
    CSV_DATA = csv.reader(csvfile, delimiter=' ', quotechar='|')
    for row in CSV_DATA:
         print(''.join(row))

Output


Name,NOC,Discipline
AALERUDKatrine,Norway,CyclingRoad
ABADNestor,Spain,ArtisticGymnastics
ABAGNALEGiovanni,Italy,Rowing
ABALDEAlberto,Spain,Basketball
ABALDETamara,Spain,Basketball
ABALOLuc,France,Handball
ABAROACesar,Chile,Rowing
ABASSAbobakr,Sudan,Swimming
ABBASALIHamideh,IslamicRepublicofIran,Karate
ABBASOVIslam,Azerbaijan,Wrestling

We can also use simple open statement as follows in python.


df_ss = open (r"C:\Users\Yosef\OneDrive - cumc.columbia.edu\Desktop\DATA\Athletes.csv")
print(list(df_ss))

Output

['Name,NOC,Discipline\n', 'AALERUD Katrine,Norway,Cycling Road\n', 'ABAD Nestor,Spain,Artistic Gymnastics\n', 'ABAGNALE Giovanni,Italy,Rowing\n', 'ABALDE Alberto,Spain,Basketball\n', 'ABALDE Tamara,Spain,Basketball\n', 'ABALO Luc,France,Handball\n', 'ABAROA Cesar,Chile,Rowing\n', 'ABASS Abobakr,Sudan,Swimming\n', 'ABBASALI Hamideh,Islamic Republic of Iran,Karate\n', 'ABBASOV Islam,Azerbaijan,Wrestling\n', 'ABBINGH Lois,Netherlands,Handball\n', 'ABBOT Emily,Australia,Rhythmic Gymnastics\n', 'ABBOTT Monica,United States of America,Baseball/Softball\n', 'ABDALLA Abubaker Haydar,Qatar,Athletics\n', 'ABDALLA Maryam,Egypt,Artistic Swimming\n', 'ABDALLAH Shahd,Egypt,Artistic Swimming\n', 'ABDALRASOOL Mohamed,Sudan,Judo\n', 'ABDEL LATIF Radwa,Egypt,Shooting\n', 'ABDEL RAZEK Samy,Egypt,Shooting\n', 'ABDELAZIZ Abdalla,Egypt,Karate\n', 'ABDELAZIZ Farah,Egypt,Table Tennis\n', 'ABDELAZIZ Feryal,Egypt,Karate\n', 'ABDELMAWGOUD Mohamed,Egypt,Judo\n', 'ABDELMOTTALEB Diaaeldin Kamal Gouda,Egypt,Wrestling\n', 'ABDELRAHMAN Ihab,Egypt,Athletics\n', 'ABDELSALAM Mohamed,Egypt,Football\n', 'ABDELSALAM Nour,Egypt,Taekwondo\n', 'ABDELWAHED Ahmed,Italy,Athletics\n', 'ABDI Bashir,Belgium,Athletics\n', 'ABDIRAHMAN Abdi,United States of America,Athletics\n', 'ABDUL HADI Farah Ann,Malaysia,Artistic Gymnastics\n', 'ABDUL RAHMAN Kiria Tikanah,Singapore,Fencing\n', 'ABDUL RAZZAQ Fathimath Nabaaha,Maldives,Badminton\n', 'ABDULHAMID Saud,Saudi Arabia,Football\n', 'ABDULJABBAR Ammar Riad,Germany,Boxing\n', 'ABDULLAEV Gulomjon,Uzbekistan,Wrestling\n', 'ABDULLAEV Muminjon,Uzbekistan,Wrestling\n', 'ABDULLAH Rahmat Erwin,Indonesia,Weightlifting\n', 'ABDULLIN Ilfat,Kazakhstan,Archery\n',

2. Importing Excel files

According to microsoft.com Microsoft Excel is the industry leading spreadsheet software program, a powerful data visualization and analysis tool used by millions of people world wide. To work on a data which is stored in excel spreadsheet python has features which helps us to import the data in to pandas dataframe for further analysis and use.

The syntax for importing excel files looks the following.


#Import Excel files in to python using pandas
import pandas as pd
excel_imported = pd.read_excel(r"C:\Users\Yosef\OneDrive - cumc.columbia.edu\Desktop\COP21\Yosef.xlsx")
excel_imported.head(10)

Output


age     sex     productid       FollowUpDate    followupdate_et
0       43      F       227839  2021-06-29      22/10/2013
1       45      F       54759   2021-09-25      15/01/2014
2       51      F       1173639 2021-07-12      2013-05-11 00:00:00
3       52      F       4556    2021-06-24      17/10/2013
4       52      M       6537    2021-08-02      26/11/2013
5       21      M       11939   2021-10-07      27/01/2014
6       63      M       1439    2021-10-16      2014-06-02 00:00:00
7       48      M       5306    2021-08-05      29/11/2013
8       67      M       1085    2021-05-15      2013-07-09 00:00:00
9       40      F       8591    2021-06-18      2013-11-10 00:00:00

After we get the dataframe we can work on different visuals to see whether the data is complete or not.

let us plot histogram for age of the respondents.


import matplotlib.pyplot as plt
import numpy as np
excel_imported["age"].hist(bins=20)
plt.show()

Output

3. Importing Text Files

Text file is a file with extension of TXT or doc containing non formatted text. To import such files the following general syntax can be used.


#Import Text Files in to dataframe
import numpy as np
f = open("C:/Users/Yosef/OneDrive - cumc.columbia.edu/Desktop/beatles.txt", "r")
data = f.read(100)
print(data)

Output:


Yesterday, all my troubles seemed so far away
Now it looks as though they're here to stay
Oh, I believe in yesterday Suddenly, I'm not half the man I used to be
There's a shadow hanging over me.
Oh, y

0 comments

Recent Posts

See All

Comentarios