top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Writer's pictureNtandoyenkosi Matshisela

Predicting the Parkinson’s disease using Machine learning models



Introduction




Judging from this grim outlook of things it is therefore important for us to build predictive models that can inform one that he has the disease, and help the patient to make informed decisions in combating the degenerative course of the body. To do this we will use Parkinson’s data which is found here


We will also employ 10-fold cross-validation to ascertain the accuracy of the models before testing them against the test set.


Objectives


The objectives of this analysis are to:


1. To carry out an exploratory data analysis

2. To model the data using the base classifiers such as the Logistic regression, Support Vector Machine, and the kNN models

3. To model the data using the ensemble models


Analysis


We use the following variables


Matrix column entries (attributes):

name - ASCII subject name and recording number

MDVP:Fo(Hz) - Average vocal fundamental frequency

MDVP:Fhi(Hz) - Maximum vocal fundamental frequency

MDVP:Flo(Hz) - Minimum vocal fundamental frequency

MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency

MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude

NHR,HNR - Two measures of ratio of noise to tonal components in the voice

status - Health status of the subject (one) - Parkinson's, (zero) - healthy

RPDE,D2 - Two nonlinear dynamical complexity measures

DFA - Signal fractal scaling exponent

spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation



To conduct this analysis we will use the following packages:



# Libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
import xgboost
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb

We quickly look at the size of the data and the distribution of the variables. It can be observed that we have varied average metric for columns suggesting tha there will be a need for scaling the dataset. The wide differences between the (minimum and the 25 percentile) and (maximum and the 75th percentile) seem to suggest that there are outliers. The data will need to be scaled and the outliers to be removed.



print(park_data.describe())



We can cement this conclusion by getting the pairs plot which is obtained via the seaborn library. We do this by the following command:



sns.pairplot(park_data, hue='status', diag_kind = 'hist')


We can see a high level of correlation between some variables which cause multicollinearity issues to some models. Indeed the scatter plots also show elements of outliers. The histograms also show a degree of skewness for each independent variable. These issues are a source of problem for models.


Our target variable ‘status’ is of imbalanced, making the Area Under Curve (AUC) metric not suitable to compare the models. We will use the accuracy metric to compare our models. That being as it may, the individuals without the disease are fewer than those with the disease, this is shown below using the following code


sns.countplot(park_data.status)


To help cure this problem, we will standardise the data, and this will give a sense of a normal distribution for our variables.


In selecting the variables we drop the target variable (‘status’), and the ‘name’ variable as it will not give any valuable information towards discriminating the target variable



X = park_data.drop(['name', 'status'], axis=1)
y = park_data.status 

We also select 75% of the data to be the train set and the rest which is 25% to be the test set. Indeed there are drawbacks of not having a validation set, however, due to the size of the data, we have to rely on only 2 sets.


To execute our final objective(s) that of the employment of the base classifiers and the ensemble classifiers, we use the following pipeline code which follows this line of thought https://www.kaggle.com/code/rakesh2711/multiple-models-using-pipeline/notebook, however we will twerk the code here and there to suite our needs:



pipelines = []
pipelines.append(('scaledLR' , (Pipeline([('scaled' , StandardScaler()),('LR' ,LogisticRegression())]))))
pipelines.append(('scaledSVC' , (Pipeline([('scaled' , StandardScaler()),('SVC' ,SVC())]))))
pipelines.append(('scaledKNN' , (Pipeline([('scaled' , StandardScaler()),('kNN' ,KNeighborsClassifier())]))))
pipelines.append(('scaledRF' , (Pipeline([('scaled' , StandardScaler()),('RF' ,RandomForestClassifier())]))))
pipelines.append(('scaledXGB' , (Pipeline([('scaled' , StandardScaler()),('XGB' ,xgb.XGBClassifier())]))))

model_name = []
results = []
for pipe ,model in pipelines:
    kfold = KFold(n_splits=10, random_state=263, shuffle=True)
    cross_results = cross_val_score(model , X_train ,y_train ,cv =kfold , scoring='accuracy')
    
    model_scaled = model.fit(X_train, y_train)
    y_pred = model_scaled.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    results.append(cross_results)
    model_name.append(pipe)
    msg = "%s: %f (%f)" % (model_name, cross_results.mean(), cross_results.std())
    acc = "%s: %f" % (model_name, accuracy)
    
    print(msg)
    print(acc)
    
    
# Compare different Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(model_name)
plt.show()



Basically the code does the following:

1. Create a model vector

2. The models are scaled using the Standard Scaler function

3. Execute a 10-fold cross validation experiment

4. Get the mean and standard deviation for the 10 sets of data for each model

5. Plot a graph for this information for each model

We can tabularise the information as follows






We can also employ the models in predicting and finding the accuracy of each model



We can come to the conclusion of saying, the XGBoost is the best classifier in both the train and test set, achieving an accuracy of 94%.

Improvement can be done by executing hyperparameter tuning. The code for this analysis is found here


0 comments

Recent Posts

See All

Comments


bottom of page