Govind Savara

Aug 7, 20203 min

Parkinson's Disease Exploratory Data Analysis and Prediction

Neurodegenerative diseases are a heterogeneous group of disorders that are characterized by the progressive degeneration of the structure and function of the nervous system. They are incurable and debilitating conditions that cause problems with mental functioning also called dementias. Alzheimer’s disease and Parkinson’s disease are the most common neurodegenerative diseases.

I performed Exploratory Data analysis and then predicted the Parkinson's Disease using the data set from UCI ML Parkinson’s dataset.

The data set has 195 samples. Each row of the data set consists of voice recording of individuals with name and 23 attributes of biomedical voice measurements. The main aim of the data is to discriminate healthy people from those with Parkinson's Disease, according to "status" column which is set to `0` for healthy and `1` for individual affected with Parkinson's Disease.

Loading the data:

import pandas as pd
 
parkinsons_data = pd.read_csv('parkinsons.data')
 
status_value_counts = parkinsons_data['status'].value_counts()
 

 
print("Number of Parkinson's Disease patients: {} ({:.2f}%)".format(status_value_counts[1], status_value_counts[1] / parkinsons_data.shape[0] * 100))
 
print("Number of Healthy patients: {} ({:.2f}%)".format(status_value_counts[0], status_value_counts[0] / parkinsons_data.shape[0] * 100))

Number of Parkinson's Disease patients: 147 (75.38%)
 
Number of Healthy patients: 48 (24.62%)

%matplotlib inline
 
import seaborn as sns
 
import matplotlib.pyplot as plt
 

 
sns.countplot(parkinsons_data['status'].values)
 
plt.xlabel("Status value")
 
plt.ylabel("Number of cases")
 
plt.show()

Univariate Analysis:

I used box plots and Distribution plots on every input for the EDA on healthy cases and Parkinson's disease affected cases.

# Creating a box plot for the attribute 'Average vocal fundamental frequency MDVP:Fo(Hz)'
 
diseased_freq_avg = parkinsons_data[parkinsons_data["status"] == 1]["MDVP:Fo(Hz)"].values
 
healthy_freq_avg = parkinsons_data[parkinsons_data["status"] == 0]["MDVP:Fo(Hz)"].values
 

 
plt.boxplot([diseased_freq_avg, healthy_freq_avg])
 
plt.title("Average vocal fundamental frequency MDVP:Fo(Hz) Box plot")
 
plt.xticks([1, 2], ["Parkinson's Disease Cases", "Healthy Cases"])
 
plt.show()
 

# Creating a distribution plot with histograms for the attribute 'Average vocal fundamental frequency MDVP:Fo(Hz)'
 
plt.figure(figsize=(10,5))
 
sns.distplot(diseased_freq_avg, hist=True, label="Parkinson's Disease Cases")
 
sns.distplot(healthy_freq_avg, hist=True, label="Healthy Cases")
 
plt.title("Average vocal fundamental frequency MDVP:Fo(Hz) Distribution plot")
 
plt.legend()
 
plt.show()

I also performed exploratory data analysis on all attributes such as Jitter, Shimmer, Non-linear measures etc.,. Some of the plots are:

Bivariate Analysis:

Also I used `sns.pairplot` to find out the pairwise relationship of the attributes. This plot gives the relationship of the attributes and will help in creating a model and selection of the attributes which really useful in the decision making.

# creating pairplot of the attributes
 
sns.pairplot(parkinsons_data, hue="status", diag_kind='kde')

Parkinson's Disease Prediction using Logistic Regression:

Using "Pipeline" we can combine all the feature engineering, pre-processing and model steps into a single object. "GridSearchCV" used to fine tune the hyperparameters to find out the model's best output and accuracy.

The data set from UCI ML Parkinson’s dataset is clean and all the rows have the values for all columns, so there is no need of data pre-cleaning. We can use "StandardScaler" to scale the dataset. Here I have used "Logistic Regression" model for the classification.

from sklearn.model_selection import train_test_split, GridSearchCV
 
from sklearn.linear_model import LogisticRegression
 
from sklearn.metrics import classification_report
 
from sklearn.preprocessing import StandardScaler
 
from sklearn.pipeline import Pipeline
 

 
# Create data set
 
y = parkinsons_data["status"]
 
X = parkinsons_data.drop(["name", "status"], axis=1)
 

 
# Setup the pipeline
 
steps = [('scaler', StandardScaler()),
 
('logreg', LogisticRegression())]
 

 
pipeline = Pipeline(steps)
 

 
# Create the hyperparameter grid
 
parameters = {'logreg__C': np.logspace(-2, 8, 15)}
 

 
# Creating train and test sets
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=102)
 

 
# Instantiate the GridSearchCV object: cv
 
cv = GridSearchCV(pipeline, parameters)
 

 
# Fit to the training set
 
cv.fit(X_train, y_train)
 

 
# Predict the labels of the test set: y_pred
 
y_pred = cv.predict(X_test)
 

 
# Compute and print metrics
 
print("Accuracy: {}".format(cv.score(X_test, y_test)))
 
print(classification_report(y_test, y_pred))
 
print("Tuned Model Parameters: {}".format(cv.best_params_))

Output:

Accuracy: 0.8983050847457628
 
precision recall f1-score support
 
0 0.80 0.80 0.80 15
 
1 0.93 0.93 0.93 44
 

 
accuracy 0.90 59
 
macro avg 0.87 0.87 0.87 59
 
weighted avg 0.90 0.90 0.90 59
 

 
Tuned Model Parameters: {'logreg__C': 7.196856730011521}

The best accuracy achieved is 89.83% with the parameter `C = 7.196856730011521`

For more details on this project view my github repository.

    1