# Machine Learning Preprocessing: Scaling

Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. It's a necessary step because many models assume that the data you are training on is normally distributed and if it isn't, the risk of bias model. Example:

Log Normalization

Scaling

When to use Standardize?

Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis.

Dataset feature are continous and different scale

Suppose a dataset contains height and weight features.In order to compare these features, they must be in the same

```
# import necessary library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
```

```
# Load dataset
df = pd.read_csv("/content/wine_types.csv")df.head()
```

`df.info()`

`df.describe()`

Proline has an extremely high variance compared to the other columns. This is an example of where a technique like a log normalization would come in handy.

```
# create feature and target
X = df.drop("Type", axis = 1)
y = df['Type']
```

## Unscaled KNeighborsClassifier

```
from sklearn.model_selection import cross_val_score,
train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Model
knn = KNeighborsClassifier()
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)
# Score the model on the test data
print(knn.score(X_test, y_test))
0.71
```

## Log Normalization

Log normalization is a method for standardizing your data that can be useful when you have a particular column with high variance. As you saw in the previous section's exercise, training a k-nearest neighbors classifier on that subset of the wine dataset didn't get a very high accuracy score.

```
# Print out the variance of the Proline column
print(df['Proline'].var())
# Apply the log normalization function to the Proline column
df['Proline_log'] = np.log(df['Proline'])
# Check the variance of the normalized Proline column
print(df['Proline_log'].var())
99166.71735542436 0.17231366191842012
```

See the change of variance!!!

## Scaling

What is Feature Scaling? Scaling is a method of Standardization that's most useful when you're working with a dataset that contains continuous features that are on different scales. Feature Scaling transforms a dataset with mean 0 variances 1. Transform approximately normal distribution. Now we are only showing StandardScaling

## Scaled KNeighborsClassifier

```
# Import StandardScaler from scikit-learnfrom sklearn.preprocessing import StandardScaler
# Create the scaling method.
ss = StandardScaler()
# Apply the scaling method to the dataset used for modeling.
# In scikit-learn, running fit_transform during preprocessing will both fit the method to the data as well as transform the data in a single step.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)
# Score the model on the test data.
print(knn.score(X_test, y_test))
```

`0.9555555555555556`

Unscaled accuracy is 0.71

Scaled accuracy is 0.95

A huge difference!!!

Dataset: Wine Types

## Comments