Python is fast becoming the preferred language in data science. It provides the larger ecosystem of a programming language and the depth of good scientific computation libraries. Pandas is a popular Python data analysis tool. It provides easy to use and highly efficient data structures. These data structures deal with numeric or labeled data, stored in the form of tables.
In this blog, we will discover some of the most important data manipulation techniques using pandas. For this purpose, we are going to use Titanic Dataset which is available on Kaggle. Techniques that will be discussed are:
Reading a CSV File
Dropping columns in the data
Dropping rows in the data
Select columns with specific data types
Replacing values in a DataFrame
1. Reading a CSV file
The CSV (Comma Separated Values) format is quite popular for storing data. A large number of datasets are present as CSV files which can be used either directly in software like Excel or can be loaded up by using programming languages like Python.
import pandas as pd # read the csv data using pd.read_csv function data = pd.read_csv('test.csv') data.head()
DataFrame provides a member function drop () i.e. It accepts a single or list of label names and deletes the corresponding rows or columns (based on the value of axis parameter i.e. 0 for rows or 1 for columns).
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
2. Dropping columns in the data
df_dropped = data.drop('Parch', axis=1) df_dropped.head()
The ‘Parch’ column is dropped in the data. The axis=1 denotes that it ‘Parch’ is a column, so it searches ‘Parch’ column-wise to drop.
We can drop multiple columns at the same time using the following code:
# Drop multiple columns df_dropped_multiple = data.drop(['SibSp', 'Name'], axis=1) df_dropped_multiple.head()
The columns ‘SibSp’ and ‘Name’ are dropped in the data.
3. Dropping rows in the data
df_row_dropped = data.drop(2, axis=0) df_row_dropped.head()
The row with index 2 is dropped in the data. The axis=0 denotes that index 2 is a row, so it searches the index 2 column-wise.
We can drop multiple rows at the same time using the following code:
# Drop multiple rows df_row_dropped_multiple = data.drop([1,4], axis=0) df_row_dropped_multiple.head()
4. Select columns with specific data types
Pandas select_dtypes function allows us to specify a data type and select columns matching the data type.
#for integer data type integer_data = data.select_dtypes('int') integer_data.head()
#for float data type float_data = data.select_dtypes('float') float_data.head()
The above code selects all columns with integer and float data types
5. Replacing values in a DataFrame
We can also replace values inplace, rather than having to re-assign them. This is done simply by setting inplace= to True
data['Sex'].replace(['male', 'female'], ["M", "F"])
The above code replaces ‘male’ as ‘M’ and ‘female’ as ‘F’.
Thank you for your time.