Machine Learning: Feature Creation & Extraction in Python

Abu Bin Fahd
Aug 8, 2022
2 min read

What is feature engineering? Feature engineering is the process that takes raw data and transforms it into features that can be used to create a predictive model using machine learning or statistical modeling, such as deep learning.

Different types of data?

Continuous
Categorical
Ordinal
Boolean
Datetime

Get to know your data

# Import pandas
import pandas as pd
# Import Combined_DS_v10.csv into so_survey_df
so_survey_df = pd.read_csv("/content/Combined_DS_v10.csv")
# Print the first five rows of the DataFrame
print(so_survey_df.head())
# Print the data type of each column
print(so_survey_df.dtypes)

Selecting specific data types Often a dataset will contain columns with several different data types (like the one you are working with). The majority of machine learning models require you to have a consistent data type across features. Similarly, most feature engineering techniques are applicable to only one type of data at a time. For these reasons among others, you will often want to be able to access just the columns of certain types when working with a DataFrame.

# Create subset of only the numeric columns
so_numeric_df = so_survey_df.select_dtypes(include=['int', 
                                'float'])
# Print the column names contained in so_survey_df_num
print(so_numeric_df.columns)

Dealing with Categorical variable One-hot encoding and dummy variables To use categorical variables in a machine learning model, you first need to represent them in a quantitative way. The two most common approaches are to one-hot encode the variables using or to use dummy variables.

# Convert the Country column to a one hot encoded Data Frame
one_hot_encoded = pd.get_dummies(so_survey_df, columns=     
                                ['Country'], prefix='OH')
# Print the columns names
print(one_hot_encoded.columns)

# Create dummy variables for the Country column
dummy = pd.get_dummies(so_survey_df, columns=['Country'], 
                drop_first=True, prefix='DM')
# Print the columns names
print(dummy.columns)

Dealing with uncommon categories

# Create a series out of the Country column
countries = so_survey_df['Country']
# Get the counts of each category
country_counts = countries.value_counts()
# Create a mask for only categories that occur less than 10 times
mask = countries.isin(country_counts[country_counts < 
                                10].index)
# Label all other categories as Other
countries[mask] = 'Other'
# Print the updated category counts
print(pd.value_counts(countries))

Numeric:Binarizing Columns While numeric values can often be used without any feature engineering, there will be cases when some form of manipulation can be useful

so_survey_df["ConvertedSalary"].fillna(0, inplace=True)
# Create the Paid_Job column filled with zeros
so_survey_df['Paid_Job'] = 0
# Replace all the Paid_Job values where ConvertedSalary is > 0
so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0,         
  'Paid_Job'] = 1
# Print the first five rows of the columns
print(so_survey_df[['Paid_Job', 'ConvertedSalary']].head())

Binning values For many continuous values you will care less about the exact value of a numeric column, but instead care about the bucket it falls into.


# Bin the continuous variable ConvertedSalary into 5 bins
so_survey_df['equal_binned'] = 
                pd.cut(so_survey_df['ConvertedSalary'], 5)
# Print the first 5 rows of the equal_binned column
print(so_survey_df[['equal_binned', 
                                 'ConvertedSalary']].head())

# Bin the ConvertedSalary column using the boundaries in the list bins and label the bins using labels
# Import numpy
import numpy as np
# Specify the boundaries of the bins
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]
# Bin labelslabels = ['Very low', 'Low', 'Medium', 'High', 'Very high']
# Bin the continuous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = 
   pd.cut(so_survey_df['ConvertedSalary'], bins=bins, 
    labels=labels)
# Print the first 5 rows of the boundary_binned column
print(so_survey_df[['boundary_binned', 
                                'ConvertedSalary']].head())

GitHub Link

datainsightonline.com

Data Scientist Program

Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

Machine Learning: Feature Creation & Extraction in Python

What is feature engineering? Feature engineering is the process that takes raw data and transforms it into features that can be used to create a predictive model using machine learning or statistical modeling, such as deep learning.

Dealing with Categorical variable One-hot encoding and dummy variables To use categorical variables in a machine learning model, you first need to represent them in a quantitative way. The two most common approaches are to one-hot encode the variables using or to use dummy variables.

Dealing with uncommon categories

Numeric:Binarizing Columns While numeric values can often be used without any feature engineering, there will be cases when some form of manipulation can be useful

Recent Posts

Comments

40 Python Projects with Source Code for Beginners

How to Read Medium Premium Articles for Free

How to use Sqlite3 using Python

Data Visualization - which types of graphs should we use?

Best Online Courses for Data Science

9 Ways to Embed Code Snippets on your Data Science Blog Posts