# Pandas' Techniques for Data Manipulation

```
# importing pandas as pd
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
```

```
# creating a dataframe
df = pd.DataFrame({'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]})
df
```

### 1. Pivot tables

While **pivot()** provides general-purpose pivoting with various data types (strings, numerics, etc.), pandas also provide **pivot_table()** for pivoting with aggregation of numeric data. The function **pivot_table()** can be used to create spreadsheet-style pivot tables.

*It takes a number of arguments*:

**Data:**a DataFrame object.**values**: a column or a list of columns to aggregate.**columns**: a column, Grouper, an array that has the same length as data, or a list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.**aggfunc**: function to use for aggregation, defaulting to numpy. mean.

```
# Creates a pivot table dataframe
table = pd.pivot_table(df, values ='A', index =['B', 'C'],
columns =['B'], aggfunc = np.sum)
table
```

### 2. Iterating over rows of DataFrame

Iterating through **pandas** objects is generally **slow**. In many cases, iterating manually over the rows is not needed and can be avoided with one of the following approaches(*in order as per performance* ).

** **

1. **Vectorization** :

many operations can be performed using built-in methods or NumPy functions or (boolean) indexing.

2. **List Comprehensions**:

**3. df.apply():**

4. **Iter__family** [df.itertuples / df.iteritems / df.itterrows]

Performance for iteration operations

```
#iterate using victorization
df_grad_vec = df['B'] == 'Graduate'
print(df_grad_vec)
#iterate using list comprehensions
df_grad_comp = [x == 'Graduate' for x in df['B']]
print(df_grad_comp)
#iterate using apply()
df_grad_apply = df.apply(lambda x:df['B']== 'Graduate' )
print(df_grad_apply)
```

**3. df.****groupby**** (split / apply / combine)**

` df.groupby(by=None, axis=0, as_index=True, dropna=True)`

*It takes a number of arguments:*

**by**: mapping, function, label, or list of labels**axis***{0 or ‘index’, 1 or ‘columns’}, default 0***as_index**{*bool, default True}*:for aggregated output, return object with group labels as the index.**dropna**{*bool, default True}***:**if group keys contain NA values, NA values together with row/column will be dropped

*By “group by” we are referring to a process involving one or more of the following steps:*

1. **Splitting** the data into groups based on some criteria.

2. **Applying** a function to each group independently as the following:

** **

** 2.1. Aggregation:** compute a summary statistic (or statistics)
for each group.

2.2. **Transformation**: perform some group-specific computations
and return a like-indexed object.

2.3. **Filtration**: discard some groups, according to a group-wise
computation that evaluates True or False.

3. **Combining** the results into a data structure.

```
df_grad_groupby = df.groupby(by='B', as_index=True)['C'].mean()
df_grad_groupby
```

**4. **__Categorical data__

__Categorical data__

Categorical are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes

on a limited, and usually fixed, number of possible values. Examples are gender, social class, blood type.

__4.1. Controlling behaviors__

the default behaviors is:

Categories are inferred from the data

Categories are unordered.

To control these behaviors we could use either

*df.astype(‘category’)*passing argument (

*dtype=’category’*) while creating the DataFrameusing

*CategoricalDtype**.*

__4.2. Statistics for Categorical data(frequencies/proportions)__:

statistic for categorical data using

*describe()*statistic for categorical data using

*value_counts()*

**4. ****Statistical Functions**

** **

5.1. Covariance

to compute pairwise covariances among the series in the DataFrame

`df.cov()`

###### 5.2. Correlation

to compute pairwise correlation among the series in the DataFrame

`df.corr()`

```
df3= pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])
df3.corr()
```