top of page
learn_data_science.jpg

Data Scientist Program

 

Free Online Data Science Training for Complete Beginners.
 


No prior coding knowledge required!

Parkinson’s Neuro-disease Exploratory Data Analysis and prediction with machine Learning

Parkinson's disease (PD), or simply Parkinson's is a long-term degenerative disorder of the central nervous system that mainly affects the motor system. The symptoms usually emerge slowly and, as the disease worsens, non-motor symptoms become more common.[1][4] The most obvious early symptoms are tremor, rigidity, slowness of movement, and difficulty with walking,[1] but cognitive and behavioral problems may also occur. Parkinson's disease dementia becomes common in the advanced stages of the disease. Depression and anxiety are also common, occurring in more than a third of people with PD.[2] Other symptoms include sensory, sleep, and emotional problems. The main motor symptoms are collectively called "parkinsonism", or a "parkinsonian syndrome". [1] In this lexture, we are going to perform and exploratory data analysis and code a machine learning model to predict if a person have this disease. 1- Dataset The dataset used is a dataset of the web site machine Learning repository[2] which contains a high quality of machine Learning dataset. The data is composed of the following features: - name - ASCII subject name and recording number - MDVP:Fo(Hz) - Average vocal fundamental frequency - MDVP:Fhi(Hz) - Maximum vocal fundamental frequency - MDVP:Flo(Hz) - Minimum vocal fundamental frequency -MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP – Several - measures of variation in fundamental frequency - MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude - NHR,HNR - Two measures of ratio of noise to tonal components in the voice - status - Health status of the subject (one) - Parkinson's, (zero) – healthy - RPDE,D2 - Two nonlinear dynamical complexity measures DFA - Signal fractal scaling exponent spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation Separation of the dataset For analysis we separed the data in 2 sets.

Analysing Gender.

The gender in the dataset is distributed as follows:

ax = sns.countplot(x='class',data=data, hue = 'gender')

ax.set_title('Frequency of Gender by Class')

1- violin plot of Formant frequency

Analysis of Formant Frequencies by gender

We have obtained the following results:

frequency_ill= illness[['gender','f1','f2','f3','f4']]

frequency_health = healthy[['gender','f1','f2','f3','f4']]

frequency_ill.head() frequency_ill['mean'] = frequency_ill[['f1','f2','f3','f4']].mean(axis=1) frequency_health['mean'] = frequency_health[['f1','f2','f3','f4']].mean(axis=1)

fig, axes = plt.subplots(1,2, sharex = True)

sns.swarmplot(x='gender', y='mean',data=frequency_ill, ax = axes[0])

sns.swarmplot(x='gender', y='mean',data=frequency_health, ax = axes[1])

axes[0].set_title("frequency of parkingson's positive")

axes[1].set_title("frequency of parkingson's negative") plt.tight_layout() fig, axes = plt.subplots(2,1, sharex = True)

sns.violinplot(x='gender', y='mean',data=frequency_ill, ax = axes[0])

sns.violinplot(x='gender', y='mean',data=frequency_health, ax = axes[1]) axes[0].set_title("frequency of parkingson's positive")

axes[1].set_title("frequency of parkingson's negative")

plt.tight_layout()

2- Boxplot

fig, axes = plt.subplots(1,2, figsize=(8,6))

sns.boxplot(x='gender', y='mean',data=frequency_ill, ax = axes[0])

sns.boxplot(x='gender', y='mean',data=frequency_health, ax = axes[1])

plt.ticks_layouts()

plt.show()

Analysis of intensity parameters

sns.distplot(illness['meanIntensity'])

sns.distplot(healthy['meanIntensity'])

table = list(np.arange(0,100,5))

percentiles_i = []

percentile_h = []

for i in table :

perc = np.round(np.percentile(illness['meanIntensity'], i),2)

per = np.round(np.percentile(healthy['meanIntensity'],i),2)

percentiles_i.append(perc)

percentile_h.append(per)

sns.distplot(percentiles_i,hist=False)

sns.distplot(percentile_h,hist=False)

plt.title('Intensity Distribution for positive and negative parkingson disease')

II-predicting Parkinson’s disease with machine Learning.

1- Importing packages

from sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.feature_selection import VarianceThreshold

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.model_selection import train_test_split,GridSearchCV

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score,f1_score

from sklearn.svm import SVC

2- Prediction with logistic Regression

The data has been split with train_test_plit and standardized.

NB. Metrics used are f1_score and accuracy.

logistic = DecisionTreeClassifier(class_weight = 'balanced',random_state = 11) logistic.fit(X_train,y_train)

prediction = logistic.predict(X_test) accuracy_score(prediction, y_test).

The accuracy score with logistics regression is 0.86

f1_score(rf_prediction, y_test)

The f1_score with logistics regression is 0.87

3- Prediction with Random Forest Classifier

rf = RandomForestClassifier() rf.fit(X_train,y_train) rf_prediction = rf.predict(X_test) accuracy_score(rf_prediction, y_test)

Accuracy with random Forest is 0.92

f1_score(rf_prediction, y_test)

The f1_scoreis 0.95

4- Prediction with SVM.

sv = SVC(class_weight = 'balanced') sv.fit(X_train,y_train)

prediction = sv.predict(X_test) accuracy_score(prediction, y_test) f1_score(prediction,y_test)

With SVM we have obtained an accuracy of 0.86 and a f1_score of 0.92

5- Hyper parameters tuning

from scipy.stats import uniform

C = [1.0,1.5,2.0,2.5]

param_grid = dict(C=C)

lr = LogisticRegression(penalty='l2')

grid = GridSearchCV(estimator = lr, param_grid=param_grid,scoring = 'accuracy', cv = 4,n_jobs = -1)

grid.fit(X_train, y_train)

y_pred = grid.predict(X_test)

f1_score(prediction, y_test)

grids.score(X_test, y_test)

Accuracy: 0.81

F1_score:0.87

n_estimators = [10,100,200]

max_features = [4,5,8]

params_grid = dict(n_estimators = n_estimators, max_features = max_features)

rfc = RandomForestClassifier()

search = GridSearchCV(estimator = rfc, param_grid=params_grid, cv = 3, scoring = 'accuracy', n_jobs=-1)

search.fit(X_train,y_train)

search.score(X_test, y_test)

f1_score(pred, y_test) kernels

Accuracy:0.83

F1_score:0.88

['poly','rbf','sigmoid']

C = [0.1,10,100]

gamma = [1,0.1,0.01,0.001]

param_grid = dict(C=C, kernel = kernels, gamma = gamma)

svm = SVC()

grids = GridSearchCV(estimator = svm, param_grid = param_grid, scoring='accuracy', cv = 3, n_jobs = -1)

grids.fit(X_train,y_train)

prediction =grids.predict(X_test)

grids.score(X_test, y_test)

f1_score(prediction, y_test)

Accuracy: 0.95

F1_score:0.96.

After a grid search we conclude that the Best model is a support vector machine with and accuracy of 0.95 and a F1_score of 0.96.

6- Conclusion.

We have implemented a model which can predict the presence of Parkinson’s disease with a accuracy of 0.95 and with F1_score of 0.96.

Bibliography:

1. Wikipedia.com/Parkinson’s disease

2. https://archive.ics.uci.edu/ml/datasets/Parkinsons

 
 
 

Commenti


COURSES, PROGRAMS & CERTIFICATIONS

 

Advanced Business Analytics Specialization

Applied Data Science with Python (University of Michigan)

Data Analyst Professional Certificate (IBM)

Data Science Professional Certificate (IBM)

Data Science Specialization (John Hopkins University)

Data Science with Python Certification Training 

Data Scientist Career Path

Data Scientist Nano Degree Program

Data Scientist Program

Deep Learning Specialization

Machine Learning Course (Andrew Ng @ Stanford)

Machine Learning, Data Science and Deep Learning

Machine Learning Specialization (University of Washington)

Master Python for Data Science

Mathematics for Machine Learning (Imperial College London)

Programming with Python

Python for Everybody Specialization (University of Michigan)

Python Machine Learning Certification Training

Reinforcement Learning Specialization (University of Alberta)

Join our mailing list

Data Insight participates in affiliate programs and may sometimes get a commission through purchases made through our links without any additional cost to our visitors.

bottom of page