# Machine Learning Preprocessing: Scaling

**Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. It's a necessary step because many models assume that the data you are training on is normally distributed and if it isn't, the risk of bias model.**
Example:

Log Normalization

Scaling

**When to use Standardize?**

Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis.

Dataset feature are continous and different scale

Suppose a dataset contains height and weight features.In order to compare these features, they must be in the same

*# import necessary library*
**import** pandas **as** pd
**import** numpy **as** np
**import** matplotlib.pyplot **as** plt
**%**matplotlib inline

*# Load dataset*
df **=** pd**.**read_csv("/content/wine_types.csv")df**.**head()

`df`**.**info()

`df`**.**describe()

**Proline** has an extremely high variance compared to the other columns. This is an example of where a technique like a log normalization would come in handy.

*# create feature and target*
X **=** df**.**drop("Type", axis **=** 1)
y **=** df['Type']

**Unscaled KNeighborsClassifier**

**from** sklearn.model_selection **import** cross_val_score,
train_test_split
**from** sklearn.neighbors **import** KNeighborsClassifier
*# Split the dataset and labels into training and test sets*
X_train, X_test, y_train, y_test **=** train_test_split(X, y)
*# Model*
knn **=** KNeighborsClassifier()
*# Fit the k-nearest neighbors model to the training data*
knn**.**fit(X_train, y_train)
*# Score the model on the test data*
print(knn**.**score(X_test, y_test))
0.71

**Log Normalization**

Log normalization is a method for standardizing your data that can be useful when you have a particular column with high variance. As you saw in the previous section's exercise, training a k-nearest neighbors classifier on that subset of the wine dataset didn't get a very high accuracy score.

```
# Print out the variance of the Proline column
print(df['Proline'].var())
# Apply the log normalization function to the Proline column
df['Proline_log'] = np.log(df['Proline'])
# Check the variance of the normalized Proline column
print(df['Proline_log'].var())
99166.71735542436 0.17231366191842012
```

See the change of variance!!!

**Scaling**

**What is Feature Scaling?**
Scaling is a method of Standardization that's most useful when you're working with a dataset that contains continuous features that are on different scales.
Feature Scaling transforms a dataset with mean 0 variances 1. Transform approximately normal distribution.
Now we are only showing **StandardScaling**

**Scaled KNeighborsClassifier**

```
# Import StandardScaler from scikit-learnfrom sklearn.preprocessing import StandardScaler
# Create the scaling method.
ss = StandardScaler()
# Apply the scaling method to the dataset used for modeling.
# In scikit-learn, running fit_transform during preprocessing will both fit the method to the data as well as transform the data in a single step.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)
# Score the model on the test data.
print(knn.score(X_test, y_test))
```

`0.9555555555555556`

**Unscaled accuracy is 0.71**

**Scaled accuracy is 0.95**

**A huge difference!!!**

Dataset: Wine Types

## Comments