# Data Manipulation With Pandas Library

**Pandas** is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language.

__To install the pandas library__

`pip install pandas`

__We need to import ____the ____pandas library before ____using____ it. __

`import pandas as pd`

__To read CSV data, we use read_csv of pandas__

`df = pd.read_csv('iris.csv')`

__To explore data read we can use ____the ____.head or .tail method__

`df.head()`

.head() gives the top five rows of the dataframe. We can input the number of rows to .head() that we would like to read. Similarly, we can read the last five rows of the dataframe using the .tail method. For example, df.head(10) reads the first 10 lines of rows and df.tail(10) reads 10 rows from the last rows.

__To read the unique values of the column__

`df['sepal_length'].unique()`

All the unique values of the sepal_length column are printed.

__Sorting the column__

Values of the column can be sorted in ascending as well as descending column. The line below sorts the value of the sepal_length column in ascending order.

`data.sort_values('sepal_length',ascending=True)`

__Subsetting data__

We can subset the data frame also. The below subset the rows only if the class of the flower is iris-setosa with sepal_length greater than 5.

`data[(data['class']=='Iris-setosa') & (data['sepal_length']>5)]`

__Check value contains in the column or nor using .isin__

The .isin is used to check the value contains in the column or not. To use .isin the column must be categorical data. The code below gives the rows with 'Iris-virginica' value of categorical column class.

`data['class'].isin(['Iris-virginica'])`

__Applying a statistical function to a column value__

There are different types of statistical functions in the pandas library like mean, median, mode, standard deviation, etc. Amond them we have used maximum and sum functions which calculate the maximum value between the value of the sepal_length and the total sum of the values of the sepal_width.

```
data['sepal_length'].max()
data['sepal_width'].sum()
```

__Aggregate____ function__

We can apply a function to a single column or multiple columns using the .agg() function.

`data[['sepal_length','sepal_width','petal_length','petal_width']].agg(['min'])`

In the above line,we have use a minimum function to calculate a minimum value for column name 'sepal_length','sepal_width' ,'petal_length' and 'petal_width' columns.

We can apply a user-defined function to a single column or multiple columns also. From the example below, we have created a function named as test that takes an argument col. The test function calculates the maximum value and minimum value of the column passed to the test function. The test function is applied to all columns defined inside the square bracket of data. That means the test function calculates the maximum and minimum value of all of the columns sepal_length, sepal_width, petal_length, and petal_width.

```
def test(col):
return col.max(),col.min()
data[['sepal_length','sepal_width','petal_length','petal_width']].agg(test)
```

__GroupBy__

Pandas** dataframe.groupby()** function is used to split the data into groups. The dataframe data is grouped with respect to each value of the column class and count value of sepal_length for each class type.

`data.groupby(['class'])['sepal_length'].count()`

__Pivot Table__

The pivot_table() function is used to create a spreadsheet-style pivot table as a DataFrame.A simple example of the pivot_table is added to the snippet of the code. Count a sepal_length value for each individual value of the class column.

**data.pivot_table(values='sepal_length',index='class',aggfunc='count') **

__Slicing the dataframe__

We can select a part of dataframe using the .iloc and .loc methods. .loc is label-based, which means that we have to specify the name of the rows and columns that we need to filter out. On the other hand, iloc is integer index-based. So here, we have to specify rows and columns by their integer index.

`data.iloc[0:10,0:3]`

0:10 reads the first 10 ten rows and 0:3 reads 3 columns from the left side.

The command below reads a row only with value='Iris-setosa' and column name from 'sepal_length' to column name 'petal_length'.

`data.loc['Iris-setosa','sepal_length':'petal_length']`

## Comments