Machine learning is the field that teaches machines and computers to learn from existing data to make predictions on new data. In this article we will learn how to use Python to perform supervised learning and unsupervised learning, two essential components of machine learning.
import numpy as npimport pandas as pd import matplotlib.pyplot as plt from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import OneHotEncoder from sklearn.linear_model import LinearRegression, Ridge from sklearn.model_selection import train_test_split, cross_val_score from sklearn.cluster import KMeans from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import make_pipeline import warnings warnings.filterwarnings('ignore')
Supervised learning is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately.
Classification is a task that requires the use of machine learning algorithms that learn how to assign a class label to examples from the problem domain. An easy to understand example is classifying emails as “spam” or “not spam”, people as “male” or “female”.
We will begin our study with a small dataset designed to provide an idea about whether a person's gender can be predicted with an acuuracy significantly above 50% based on their personal preferences.
We will fit a k-Nearest Neighbors classifier to the dataset.
df = pd.read_csv("gender_data_set.csv") print(df.info()) print(df.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 66 entries, 0 to 65 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Favorite Color 66 non-null object 1 Favorite Music Genre 66 non-null object 2 Favorite Beverage 66 non-null object 3 Favorite Soft Drink 66 non-null object 4 Gender 66 non-null object dtypes: object(5) memory usage: 2.7+ KB None Favorite Color Favorite Music Genre Favorite Beverage Favorite Soft Drink \ 0 Cool Rock Vodka 7UP/Sprite 1 Neutral Hip hop Vodka Coca Cola/Pepsi 2 Warm Rock Wine Coca Cola/Pepsi 3 Warm Folk/Traditional Whiskey Fanta 4 Cool Rock Vodka Coca Cola/Pepsi Gender 0 F 1 F 2 F 3 F 4 F
Scikit-learn does not work with string value on KNN. So we need to implement a function to return the response and the features arrays after encoding the strings to numeric values.
def encoder(df): ''' Encode categorical features as a one-hot numeric array ''' y = df['Gender'] # response pandas serie X = df.drop('Gender', axis=1) # features array ency = OneHotEncoder(handle_unknown='ignore') ency.fit(np.array(y).reshape(-1, 1)) encX = OneHotEncoder(handle_unknown='ignore') encX.fit(X) y = ency.transform(np.array(y).reshape(-1, 1)).toarray() X = encX.transform(X).toarray() return X, y
X, y = encoder(df) # K-NN classifier with 7 neighbors knn = KNeighborsClassifier(n_neighbors=7) knn.fit(X, y) y_pred = knn.predict(X) # Measuring model performance print(knn.score(X, y), " not a good start ;)") print(ency.inverse_transform(y_pred)[:5])
0.7121212121212122 not a good start ;) [['M'] ['M'] ['F'] ['F'] ['F']]
Regression is best suited to problems that requires a continuous outcome. We will learn about fundamental concepts in regression and apply them to predict the maximum temperature given the minimum temperature.
df = pd.read_csv("summary_of_weather.csv", low_memory=False) df = df[["MaxTemp", "MinTemp"]] print(df.head())
MaxTemp MinTemp 0 25.555556 22.222222 1 28.888889 21.666667 2 26.111111 22.222222 3 26.666667 22.222222 4 26.666667 21.666667
y = df['MaxTemp'].values.reshape(-1, 1) X = df['MinTemp'].values.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=42) reg_all = LinearRegression() reg_all.fit(X_train, y_train) y_pred = reg_all.predict(X_test) reg_all.score(X_test, y_test)
Cross-validation is a technique used to resist the dependence on the way that the data is split. So, we begin by spliting the dataset into k groups or folds, then we hold out the first fold as a test set, fit our model on the remaining folds, predict on the test set, and compute the metric of interest (here R squared). Next, we hold out the second fold as our test set and do the same thing as the first fold. We continue doing this to all the folds. As a result we get k values of R squared from which we can compute statistics of interest, such as the mean...
reg = LinearRegression() cv_results = cross_val_score(reg, X, y, cv=5) print(cv_results) print(np.mean(cv_results))
[0.61699036 0.81851337 0.82021727 0.65761298 0.25587764] 0.6338423249654739
Learning about linear regression and how to use it in scikit-learn is an essential first step toward using regularized linear models.
Ridge regression belongs a class of regression tools that use L2 regularization. The other type of regularization, L1 regularization, limits the size of the coefficients by adding an L1 penalty equal to the absolute value of the magnitude of coefficients. This sometimes results in the elimination of some coefficients altogether, which can yield sparse models. L2 regularization adds an L2 penalty, which equals the square of the magnitude of coefficients. All coefficients are shrunk by the same factor (so none are eliminated). Unlike L1 regularization, L2 will not result in sparse models.
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=42) ridge = Ridge(alpha=0.1, normalize=True) ridge.fit(X_train, y_train) ridge_pred = ridge.predict(X_test) ridge.score(X_test, y_test)
Unsupervised learning finds patterns in data but without a specific prediction task in mind.
We will introduce the k-means clustering. It finds clusters of samples Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups.
The K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
df = pd.read_csv('segmented_customers.csv') df.head()
2D Clustering based on Age and Spending
Scoreplt.scatter(x='Age', y='Spending Score', data=df) plt.xlabel('Age') plt.ylabel('Spending Score') plt.title('Scatter plot of Age v/s Spending Score') plt.show()
Deciding K value
samples = df[['Age', 'Spending Score']].values ks = range(1, 6) inertias =  for k in ks: # Create a KMeans instance with k clusters: model model = KMeans(n_clusters=k) model.fit(samples) inertias.append(model.inertia_) plt.plot(ks, inertias, '-o') plt.xlabel('number of clusters, k')plt.ylabel('inertia') plt.xticks(ks) plt.show()
Applying KMeans for k=4
model = KMeans(n_clusters=4) labels = model.fit_predict(samples) centroids = model.cluster_centers_ plt.scatter(x='Age', y='Spending Score', data=df, c=labels) # plot the centroids plt.scatter(x=centroids[:,0], y=centroids[:, 1], c='red', alpha=0.5, s=200) plt.xlabel('Age') plt.ylabel('Spending Score') plt.title('Scatter plot of Age v/s Spending Score') plt.show()
Principal Component Analysis
Dimension reduction summarizes a dataset using its common occuring patterns. In this paragraph, you'll learn about the most fundamental of dimension reduction techniques, "Principal Component Analysis" ("PCA"). PCA is often used before supervised learning to improve model performance and generalization. It can also be useful for unsupervised learning.
df = pd.read_csv('wine.csv')[['total_phenols', 'od280']] samples = df.values print(df.corr()) print("od280 and total_phenols are positively correlated") plt.scatter(x=samples[:,0], y=samples[:, 1]) plt.show() model = PCA() model.fit(samples) transformed = model.transform(samples) transformed_df = pd.DataFrame(transformed) print(transformed_df.corr()) print("od280 and total_phenols are no longer correlated") plt.scatter(x=transformed[:,0], y=transformed[:, 1]) plt.show()
PCA aligns principal components with the axes.
[[-0.64116665 -0.76740167] [-0.76740167 0.64116665]]
Intrinsic dimension of the fish data
scaler = StandardScaler() df = pd.read_csv('fish.csv').drop('A', axis=1) samples = np.array(df) pca = PCA() pipeline = make_pipeline(scaler, pca) pipeline.fit(samples) features = range(pca.n_components_) plt.bar(features, pca.explained_variance_) plt.xlabel('PCA feature') plt.ylabel('variance') plt.show()
We have 2 PCA features with significant variance, so the reasonable choice for the intrinsic dimension of the fish measurements would be 2.
pca = PCA(n_components=2) pca.fit(samples) transformed = pca.transform(samples) print(transformed.shape)