The power of pandas
In this blog we will explain how pandas is powerful library for data scientist.
At the first What is The Pandas ?
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
History of pandas
In 2008, pandas development began at AQR Capital Management. By the end of 2009 it had been open sourced, and is actively supported today by a community of like-minded individuals around the world who contribute their valuable time and energy to help make open source pandas possible. Thank you to all of our contributors. Since 2015, pandas is a NumFOCUS sponsored project. This will help ensure the success of development of pandas as a world-class open-source project.
How to install pandas ?
Download Anaconda for your operating system and the latest Python version, run the installer, and follow the steps. Please note:
It is not needed (and discouraged) to install Anaconda as root or administrator.
When asked if you wish to initialize Anaconda3, answer yes.
Restart the terminal after completing the installation.
Detailed instructions on how to install Anaconda can be found in the Anaconda documentation.
2-In the Anaconda prompt (or terminal in Linux or MacOS), start JupyterLab:
3-In JupyterLab, create a new (Python 3) notebook:
4-In the first cell of the notebook, you can import pandas and check the version with:
Now we ready to use pandas so let's learn more techniques with pandas.
Importing Pandas library
At the first we import pandas library
import pandas as pd
now we ready to use pandas in our code.
Reading Csv Files
CSV(comma separated files) files are a straightforward way to store large data collections. pandas make us read csv file data easily
in this article I use Titanic Data from Kaggle
df=pd.read_csv('train.csv')
we can also take a look in our data by using .head()
df.head()
Getting information
we can getting information about our data by using .info()
df.info()
Working with null values
as we see there is some null value in Age & cabin columns. pandas working with null values easily.
Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame :
isnull()
we use isnull() function this function return dataframe of Boolean values which are True for NaN values.
notnull()
In order to check null values in Pandas Dataframe, we use notnull() function this function return dataframe of Boolean values which are False for NaN values.
dropna()
In order to drop a null values from a dataframe, we used dropna() function this function drop Rows/Columns of datasets with Null values in different ways.
fillna()
function replace NaN values with some value of their own
in our data we use .fillna() to resolve cabin problem
df["Cabin"].fillna("No Cabin", inplace = True)
as we shown the null value disappear
Pivot table
Pivot table in pandas is an excellent tool to summarize one or more numeric variable based on two other categorical variables.
df.pivot(index='PassengerId', columns='Sex')['Age']
Check duplicates
pandas make us to check duplicates in our data by using .duplicated()
df.duplicated()
Quick statistics
Pandas .describe() is used to view some basic statistical details like percentile, mean, std etc.of a data frame or a series of numeric values.
df.describe()
Visualization with pandas
pandas can also used to visualize data by using .plot()
df['Sex'].apply(pd.value_counts).plot(kind='bar', subplots=True)
Data Frames in Pandas
what is the Data frames?
Data Frame is a 2-dimensional labeled data structure with columns of potentially different type. You can think of it like a spreadsheet or SQL table, or a dict of Series object
data = {'Brand': ['HH','TT','FF','AA'],
'Price': [22000,25000,27000,35000],
'Year': [2015,2013,2018,2018]}
df = pd.DataFrame(data, columns=['Brand','Price','Year'])
print (df)
we can sorting data that stored in data frame by using pandas
df.sort_values(by=['Brand'], inplace=True)
print (df)
Iterating over rows of Data Frame
we can iterate over rows in Pandas Dataframe by using different methods .
1-Using index attribute of the Dataframe
data = {'Name': ['Ali', 'Ahmed', 'Alaa'],
'Age': [21, 19, 20],'Stream': ['Math' , 'Arts', 'Biology'],
'Percentage': [88, 92, 95]}
# Convert the dictionary into
DataFramedf = pd.DataFrame(data, columns = ['Name', 'Age', 'Stream', 'Percentage'])
print("Given Dataframe :\n", df)
print("\nIterating over rows using index attribute :\n")
# iterate through each row and select # 'Name' and 'Stream' column respectively.
for ind in df.index:print(df['Name'][ind], df['Stream'][ind])
2-Using loc[] function of the Dataframe.
data = {'Name': ['Ali', 'Ahmed', 'Alaa'],
'Age': [21, 19, 20],
'Stream': ['Math' , 'Arts', 'Biology'],
'Percentage': [88, 92, 95]}
# Convert the dictionary into
DataFramedf = pd.DataFrame(data, columns = ['Name', 'Age', 'Stream', 'Percentage'])
print("Given Dataframe :\n", df)
print("\nIterating over rows using loc function :\n")
# iterate through each row and select
# 'Name' and 'Age' column respectively.
for i in range(len(df)) :print(df.loc[i, "Name"], df.loc[i, "Age"])
3-Using iloc[] function of the DataFrame.
data = {'Name': ['Ali', 'Ahmed', 'Alaa'],
'Age': [21, 19, 20],
'Stream': ['Math' , 'Arts', 'Biology'],
'Percentage': [88, 92, 95]}
# Convert the dictionary into
DataFramedf = pd.DataFrame(data, columns = ['Name', 'Age', 'Stream', 'Percentage'])
print("Given Dataframe :\n", df)
print("\nIterating over rows using iloc function :\n")
# iterate through each row and select
# 0th and 2nd index column respectively.
for i in range(len(df)) :print(df.iloc[i, 0], df.iloc[i, 2])
Conclusion
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.
At the end of articles I hope it was useful for you thank you for your time.
Resources:
-kaggle
-pandas documentation
-Geeks for Geeks
Commentaires