top of page

Data Scientist Program


Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

Predicting Parkinson's From Trembling Voice

Creator: wanderluster | Credit: Getty Images/iStockphoto

Copyright: wanderluster


Parkinson's disease (PD), or simply Parkinson's, is a long-term degenerative disorder of the central nervous system that mainly affects the motor system. The symptoms usually emerge slowly, and as the disease worsens, non-motor symptoms become more common.

The most common of the non-motor symptoms is the voice and speech changes. From there a new hope appeared by predicting Parkinson's changes by analysing voice data of subjects.


There are two groups of data that we can analyze both present as part of the project UCI ML Parkinson's dataset

First the key to understanding the meaning of the columns is:

Name - ASCII subject name and recording number

1. MDVP:Fo(Hz) - Average vocal fundamental frequency

MDVP:Fhi(Hz) - Maximum vocal fundamental frequency

MDVP:Flo(Hz) - Minimum vocal fundamental frequency

2. MDVP:Jitter(%),MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP - Several measures of variation in fundamental frequency

3.MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude

4. NHR, HNR - Two measures of the ratio of noise to tonal components in the voice

5. Status - Health status of the subject (one) - Parkinson's, (zero) - healthy

4. RPDE,D2 - Two nonlinear dynamical complexity measures

5. DFA - Signal fractal scaling exponent

6. spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation


The assumption here that we can predict Parkinson's disease from voice alteration of the patient.

Exploratory Data Analysis

The first step after importing the necessary libraries is to start exploring the data to find out any missing values or to discover what are the necessary pre-processing steps that need to be taken.

0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 17  status            195 non-null    int64  
 18  RPDE              195 non-null    float64
 19  DFA               195 non-null    float64
 20  spread1           195 non-null    float64
 21  spread2           195 non-null    float64
 22  D2                195 non-null    float64
 23  PPE               195 non-null    float64

From the above there is no missing values in the data. Now we start to ask some questions.

Is there a relation between Gender and probability of Parkinson's in the sample?

dem_gen = parkinsons_df['sex'].value_counts() * 100/len(parkinsons_df)
dem_gen = pd.DataFrame(dem_gen)
dem_gen.plot(kind='bar', figsize=(7,5))
plt.title('Gender vs Dementia')
plt.ylabel('Dimentia ratio % in dataset')

Which feature is the most significant in predicting the disease?

To answer this we need to explore the correlation between the columns

The darker the colour the more effect the column value has on the status of the disease.

There are other questions that can be answered from the data like :

what is the age range of the diseased compared to the age range of the sample(distribution graph)

predicting the data

In this we will use the XGBClassifier model and then measure the accuracy through the score attribute

model = XGBClassifier(),y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)


The accuracy score is high enough to be able to predict the data correct enough.



Recent Posts

See All


bottom of page