Techniques will be discussed are:
- Imputing missing values
- Boolean Indexing
- Apply Function
- Groupby and Plotting
The dataset used for illustration purpose is related campus recruitment and taken from Kaggle page on Campus Recruitment.
This data set consists of Placement data of students in a XYZ campus. It includes secondary and higher secondary school percentage and specialization. It also includes degree specialization, type and Work experience and salary offers to the placed students
libraries and read data
import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.read_csv("Campus Recruitment.csv") df.head()
Imputing missing values
Why do you need to fill in the missing data? Because most of the machine learning models that you want to use will provide an error if you pass NaN values into it. The easiest way is to just fill them up with 0, but this can reduce your model accuracy significantly.
Missing values are usually represented in the form of Nan or null or None in the dataset.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 215 entries, 0 to 214 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sl_no 215 non-null int64 1 gender 215 non-null object 2 ssc_p 215 non-null float64 3 ssc_b 215 non-null object 4 hsc_p 215 non-null float64 5 hsc_b 215 non-null object 6 hsc_s 215 non-null object 7 degree_p 215 non-null float64 8 degree_t 215 non-null object 9 workex 215 non-null object 10 etest_p 215 non-null float64 11 specialisation 215 non-null object 12 mba_p 215 non-null float64 13 status 215 non-null object 14 salary 148 non-null float64 dtypes: float64(6), int64(1), object(8) memory usage: 25.3+ KB
sl_no 0 gender 0 ssc_p 0 ssc_b 0 hsc_p 0 hsc_b 0 hsc_s 0 degree_p 0 degree_t 0 workex 0 etest_p 0 specialisation 0 mba_p 0 status 0 salary 67 dtype: int64
Make a note of NaN value under the salary column.
Different methods that you can use to deal with the missing data. The methods I will be discussing are: 1- Deleting the columns with missing data df.dropna(axis=1) 2- Deleting the rows with missing data df.dropna(axis=0) 3- Filling the missing data with (Mean,Median,Mode)value of other salary values or Constant df['salary'].fillna(0.0)¶
df['salary'] = df['salary'].fillna(0.0) df.head()
we can index by sl_no as it is a serial number more index by status as I care aboute it.
df = df.set_index(['sl_no','status']) df.head()
df['salary'].apply([np.min, np.max, np.mean])
amin 0.000000 amax 940000.000000 mean 198702.325581 Name: salary, dtype: float64
Groupby and Plotting
High_Placed = df.groupby(['gender','status']).size() print(High_Placed) High_Placed.plot(kind='bar')
gender status F Not Placed 28 Placed 48 M Not Placed 39 Placed 100 dtype: int64
Number of men who placed more then femal and number of men who Not placed more then femal.
Commerce 113 Science 91 Arts 11 Name: hsc_s, dtype: int64