The first step to data analysis is to have a look on your data before working on it, that is why you always should start with Exploratory data analysis.
Exploratory data analysis means to have an insightful look into your data before working with it. It starts withstep 1: importing your data .
Sometimes not all data you need to work with is in csv format. So Python provides a great opportunity to import data from various other formats to work with.
First of all we import Pandas as the known alias pd.
import pandas as pd
Then after that we type the code that reads the specific file of the data. There are different types of files:
- flat files: which are plain text files or tables with no relational databases
some of the commands are found in the following table:
CSV file(from a location on your drive)
CSV file (from a website)
data = pd.read_table("C:\\Users\\Someplace\\Desktop\\example2.txt")
data = pd.read_sas('meows.sas7bdat')
data = pd.read_stata('films.dta')
- Relational databases: such as SQL
from sqlalchemy import create_engine engine = create_engine('sqlite://name.sqlite' )
or using pandas:
df = pd.read_sql_query("SELECT * FROM Orders " , engine)
- Pickled files:
import pickle with open('pickledname.pkl', 'rb') as file: pickled_data = pickle.load(file)
- HDF5 files:
import h5py filename = 'name_file.hdf5' data = h5py.file(filename, 'r')
- MATLAB files:
import scipy.io filename = 'name.mat' mat = scipy.io.loadmat(filename)
There is also a way to get the data from an HTML file called web scraping. Web scraping works through BeautifulSoup
import request from bs4 import BeautifulSoup
url = ' www.something.html ' r = request.get(url) html.doc = r.text Soup = BeautifulSoup(html.doc) Pretty_Soup = Soup.pretty()print(pretty_Soup)
After importing the data here comes the step of reading and cleaning it.
firstly we start by having a general look on the data after importing it through the known commands of pandas.
This step aims at understanding what is your data about and have an idea if there is any missing or duplicate data.
There are various ways to deal with missing or duplicate data. You can read a hint about them in the blog post named Pandas Manipulation Techniques.
There is another important step in reading your data which is visualizing it. It gives an idea about outlier data and you can find any inconsistencies that might hinder your data analysis process.
Now after you have done all those steps your data is ready for work!