Jihed 503

Nov 23, 20213 min

Five most important Pandas Techniques for Data Manipulation in Python

Real-world data is messy. That’s why libraries like pandas are so valuable.

Using pandas we can take the pain out of data manipulation by extracting, filtering, and transforming data in DataFrames, clearing a path for quick and reliable data. analysis.

In this article, we will give a tutorial on some useful pandas techniques that are very important for dealing with data using python.

  1. Importing data

  2. Retrieving informations

  3. Filtering

  4. Apply Function

  5. Plotting

First of all, we have to import pandas.

import pandas as pd

Importing data using pandas

Pandas library offers many different possibilities for loading files of different formats.

csv files:

A comma-separated values (CSV) file is a plaintext file with a .csv extension that holds tabular data. This is one of the most popular file formats for storing large amounts of data.

titanic_df = pd.read_csv('titanic.csv')
 
titanic_df.info()

JSON files:

JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas. In our examples we will be using a JSON file called 'data.json'.

df = pd.read_json('data.json')
 
df.head()

HTML files:

An HTML is a plaintext file that uses hypertext markup language to help browsers render web pages. The extensions for HTML files are .html and .htm.

df = pd.read_html('https://en.wikipedia.org/wiki/Minnesota') # list of tables
 

 
df[6].tail() # displays the last five rows of the first table

Retrieving informations from DataFrame:

In order to better understand our dataset, we should know more about it using some pandas methods that describe our data.

(rows, columns)

df.shape

(20, 6)

Describe index

df.index

Index(['CHN', 'IND', 'USA', 'IDN', 'BRA', 'PAK', 'NGA', 'BGD', 'RUS', 'MEX',
 
'JPN', 'DEU', 'FRA', 'GBR', 'ITA', 'ARG', 'DZA', 'CAN', 'AUS', 'KAZ'],
 
dtype='object')

Summary statistics

df.describe()

Median of values

df.median()

POP 126.400
 
AREA 2173.060
 
GDP 1588.935
 
dtype: float64

Filtering Data:

Selecting columns by data type

We can use the pandas.DataFrame.select_dtypes(include=None, exclude=None) method to select columns based on their data types. The method accepts either a list or a single data type in the parameters include and exclude. It is important to keep in mind that at least one of these parameters (include or exclude) must be supplied and they must not contain overlapping elements.


 
In this example, we want to select the numeric columns (both integers and floats) of the dataframe by passing in the string 'number' to the include parameter.

numeric_df = df.select_dtypes(include='number')
 

 
numeric_df.head()

Selecting disjointed rows and columns

To select multiple rows and columns, we need to pass two list of values to both indexers. The code below shows how to extract the country, the population and the GDP of countries with id CHN and IND.

df.loc[['CHN', 'IND'], ['COUNTRY', 'POP', 'GDP']]

Apply function:

The pandas .apply() method takes a function as an input and applies this function to an entire DataFrame.

Calculation the number of human inhabitants per square kilometer

First, we will call the .apply() methos on our dataframe. Then use the lambda function to iterate over the rows of the dataframe. For every row, we grab the 'POP' column and divide it by the 'AREA' column. Finally, we will specify the axis=1 to tell the .apply() method that we want to apply it on the rows instead of columns.

df.apply(
 
lambda row: row['POP']*1000/row['AREA'],
 
axis=1)

Visualizing our data

We want to vusualize how chine population increases through past years. First of all, we will load data from wikepedia using html file like what we have seen from the begining. We are setting the first column as index by passing index_col as parameter and setting it to 0.

china_df = pd.read_html('https://en.wikipedia.org/wiki/Demographics_of_China', index_col=0)[5]
 

 
china_df.head()

Now that we have all data we need. We are ready to plot our dataframe.

china_df.plot(kind='line', y='Midyear population', title='China population')

Conclusion

Pandas is a powerful python library for data science. But It is not the unique, we still have to use other libraries like mathplotlib and seaborn.

GITHUB

    0