top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Five most important Pandas Techniques for Data Manipulation in Python


Real-world data is messy. That’s why libraries like pandas are so valuable.

Using pandas we can take the pain out of data manipulation by extracting, filtering, and transforming data in DataFrames, clearing a path for quick and reliable data. analysis.

In this article, we will give a tutorial on some useful pandas techniques that are very important for dealing with data using python.

  1. Importing data

  2. Retrieving informations

  3. Filtering

  4. Apply Function

  5. Plotting


First of all, we have to import pandas.

import pandas as pd

Importing data using pandas

Pandas library offers many different possibilities for loading files of different formats.


csv files:

A comma-separated values (CSV) file is a plaintext file with a .csv extension that holds tabular data. This is one of the most popular file formats for storing large amounts of data.

titanic_df = pd.read_csv('titanic.csv')
titanic_df.info()

JSON files:

JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas. In our examples we will be using a JSON file called 'data.json'.

df = pd.read_json('data.json')
df.head()

HTML files:

An HTML is a plaintext file that uses hypertext markup language to help browsers render web pages. The extensions for HTML files are .html and .htm.

df = pd.read_html('https://en.wikipedia.org/wiki/Minnesota') # list of tables

df[6].tail() # displays the last five rows of the first table

Retrieving informations from DataFrame:


In order to better understand our dataset, we should know more about it using some pandas methods that describe our data.


(rows, columns)


df.shape
(20, 6)

Describe index


df.index
Index(['CHN', 'IND', 'USA', 'IDN', 'BRA', 'PAK', 'NGA', 'BGD', 'RUS', 'MEX',
       'JPN', 'DEU', 'FRA', 'GBR', 'ITA', 'ARG', 'DZA', 'CAN', 'AUS', 'KAZ'],
      dtype='object')

Summary statistics


df.describe()

Median of values


df.median()
POP      126.400
AREA    2173.060
GDP     1588.935
dtype: float64

Filtering Data:


Selecting columns by data type


We can use the pandas.DataFrame.select_dtypes(include=None, exclude=None) method to select columns based on their data types. The method accepts either a list or a single data type in the parameters include and exclude. It is important to keep in mind that at least one of these parameters (include or exclude) must be supplied and they must not contain overlapping elements.

In this example, we want to select the numeric columns (both integers and floats) of the dataframe by passing in the string 'number' to the include parameter.


numeric_df = df.select_dtypes(include='number')

numeric_df.head()

Selecting disjointed rows and columns


To select multiple rows and columns, we need to pass two list of values to both indexers. The code below shows how to extract the country, the population and the GDP of countries with id CHN and IND.


df.loc[['CHN', 'IND'], ['COUNTRY', 'POP', 'GDP']]

Apply function:


The pandas .apply() method takes a function as an input and applies this function to an entire DataFrame.


Calculation the number of human inhabitants per square kilometer


First, we will call the .apply() methos on our dataframe. Then use the lambda function to iterate over the rows of the dataframe. For every row, we grab the 'POP' column and divide it by the 'AREA' column. Finally, we will specify the axis=1 to tell the .apply() method that we want to apply it on the rows instead of columns.


df.apply(
    lambda row: row['POP']*1000/row['AREA'],
    axis=1)

Visualizing our data


We want to vusualize how chine population increases through past years. First of all, we will load data from wikepedia using html file like what we have seen from the begining. We are setting the first column as index by passing index_col as parameter and setting it to 0.


china_df = pd.read_html('https://en.wikipedia.org/wiki/Demographics_of_China', index_col=0)[5]

china_df.head()

Now that we have all data we need. We are ready to plot our dataframe.


china_df.plot(kind='line', y='Midyear population', title='China population')

Conclusion

Pandas is a powerful python library for data science. But It is not the unique, we still have to use other libraries like mathplotlib and seaborn.


0 comments

Recent Posts

See All
bottom of page