top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Pandas Techniques for Data Manipulation in Python

Introduction

Pandas is an open-source python library that is used for data manipulation and analysis. It provides many functions and methods to speed up the data analysis process. Pandas is built on top of the NumPy package, hence it takes a lot of basic inspiration from it. The two primary data structures are Series which is 1 dimensional and DataFrame which is 2 dimensional.

As pandas is the best Library for data manipulation , today we are going to learn some of techniques that we always use to get our job done.

--------------------------------------------------------------------------------------------------

first we will load our data set.

import pandas as pd
df = pd.read_csv("minerals_price_changes.csv")
df.head()

Output is :



1- groupby()


The “groupby()” function is very useful in data analysis as it allows us to unveil the underlying relationships among different variables. And then we can apply Aggregations as well on the groups with the “agg()” function and pass it with various aggregation operations such as mean, size, sum, std etc.


now let's group by year and take the sum.

df.groupby('Year').agg(sum).head()

Output is :



2- drop column


DataFrame has a method called drop() that removes rows or columns according to specify column(label) names and corresponding axis.


let's see an example if we need to delete Price_silver column:


df.drop("Price_silver", axis = 1, inplace = True)
df.head()

Output is :



3- . loc() and iloc()


loc() and iloc() methods are used in slicing data from the pandas DataFrame which helps in filtering the data according to some given condition.

loc – select by labels

iloc – select by positions


let's see an example if we want to slice a price of alum and gold in 1992.


df.loc[0:10,['Year','Month','Price_alum', 'Price_gold']]

Output is:



now let's see an example if we want to slice a price of alum and gold in 1992 using iloc.

df.iloc[0:10,:4]

Output is :



4- Plotting


We can plot a dataframe using the plot() method. But we need a dataframe to plot. We can create a dataframe by just passing a dictionary to the DataFrame() method of the pandas library.


let's now see an example of The price of gold has changed over the years.

df.plot(kind='line',x='Price_gold',y='Year')

Output is:


5- Pivot Table


Pivot table in pandas is an excellent tool to summarize one or more numeric variable based on two other categorical variables.

Pivot tables in pandas are popularly seen in MS Excel files. In python, Pivot tables of pandas dataframes can be created using the command: pandas.pivot_table.

You can aggregate a numeric column as a cross tabulation against two categorical columns. In this article, you’ll see how to create pivot tables in pandas and understand its parameters with worked out examples.



pandas.pivot_table


Syntax


pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’, fill_value=None, margins=False, dropna=True, margins_name=’All’, observed=False)


Purpose:


Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table of pandas will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame


Parameters:


data: Dataframe, The dataset whose pivot table is to be made. values: Column, The feature whose statistical summary is to be seen. index: Column, Used for indexing the feature passed in the values argument columns: Column, Used for aggregating the values according to certain features observed bool, (default False): This parameter is only applicable for categorical features. If it is set to ‘True’ then the table will show values only for categorical groups


Returns:


DataFrame, An Excel style pivot table

Use the pd.pivot_table() function and specify what feature should go in the rows and columns using the index and columns parameters respectively. The feature that should be used to fill in the cell values should be specified in the values parameter.


Let’s create a sample dataset.


import numpy as np

df = pd.DataFrame({'First Name': ['Aryan', 'Rohan', 'Riya', 'Yash', 'Siddhant', ],'Last Name': ['Singh', 'Agarwal', 'Shah', 'Bhatia', 'Khanna'],'Type': ['Full-time Employee', 'Intern', 'Full-time Employee','Part-time Employee', 'Full-time Employee'],'Department': ['Administration', 'Technical', 'Administration','Technical', 'Management'],'YoE': [2, 3, 5, 7, 6],'Salary': [20000, 5000, 10000, 10000, 20000]})

df

Use pd.pivot_table and specify the data, index, columns, aggfunc and `values` parameters.


output = pd.pivot_table(data=df,index=['Type'],                                  columns=['Department'],values='Salary',                         aggfunc='mean') 
output

Output is :


Here, we have made a basic pivot table in pandas which shows the average salary of each type of employee for each department. As there are no user-defined parameters passed, the remaining arguments have assumed their default values.

0 comments

Recent Posts

See All

Comments