International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(4): 482-499
Published online December 25, 2023
https://doi.org/10.5391/IJFIS.2023.23.4.482
© The Korean Institute of Intelligent Systems
Wiharto Wiharto , Ahmad Sucipto, and Umi Salamah
Department of Informatics, Universitas Sebelas Maret, Surakarta, Indonesia
Correspondence to :
Wiharto Wiharto (wiharto@staff.uns.ac.id)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Parkinson’s disease is a neurological disorder which interferes human activities. Early detection is needed to facilitate treatment before the symptoms get worse. Earlier detection used vocal voice as a comparison with normal subject. However, detection using vocal voice still has weaknesses in detection system. Vocal voice contains a lot of information that isn’t necessarily relevant for a detection system. Previous studies proposed a feature selection method on detection system. However, the proposed method can’t handle variation in the amount of data. These variations include an imbalance sample to features and classes. In answering these problems, the Hilbert-Schmidt Independence Criterion Lasso (HSIC Lasso) feature selection method is used which has feature transformation capabilities that can produce more relevant features. In addition, detection system uses Synthetic Minority Oversampling Technique (SMOTE) method to balance data and several classification methods such as k-nearest neighbors, support vector machine, and multilayer perceptron to obtain best predictive model. HSIC Lasso produces 18 of 45 features with an accuracy of 88.34% on a small sample and 50 of 754 features with an accuracy of 96.16% on a large sample. From this result, when compared with previous studies, HSIC Lasso is more suitable on balanced data with more samples and features.
Keywords: Parkinson’s disease, Early detection, Vocal voice, HSIC Lasso
Parkinson’s disease is a chronic and progressive neurodegenerative disorders characterized by motor and non-motor dysfunction of human nerves [1]. This neurodegenerative disorder occurs in dopamine-producing cells in the substantia nigra in the midbrain [2]. When dopamine-producing cells, also known as dopaminergic cells, degenerate, dopamine levels drop. Dopamine is a neurotransmitter whose job is to bind to G protein receptors in the dopaminergic signaling system which is very important for physiological processes and balance of activities of the human body [3]. The motor dysfunction of Parkinson’s disease is characterized by muscle weakness and stiffness, slurred speech, and weight loss. Meanwhile, non-motor dysfunction in Parkinson’s disease is followed by cognitive changes, sleep disturbances, and psychiatric symptoms [4]. If these symptoms get worse, it will affect all activities of that person and is difficult to treat. Researchers continue to strive to prevent the development of Parkinson’s disease by conducting early detection, in order to make it easier for clinical officers to treat people with Parkinson’s disease.
The previous detection of Parkinson’s disease used a lot of people’s vocal voices as a comparison with normal individuals. In the study of Sakar et al. [5] said that 90% of people with Parkinson’s disease have problems with their voices. Sakar et al. [5] also proved by testing the pronunciation of letters, numbers, and some words for people with Parkinson’s disease and normal individuals. The test results show that the pronunciation of vowels provides more discriminatory information in cases of Parkinson’s disease. That’s because, voice noise as a comparison characteristic of disease is more clearly heard in the pronunciation of vowel sounds [6]. However, the use of human vocal sounds still has weaknesses in the detection system.
Human vocal sound has a lot of information that is not necessarily useful for the detection process. Little et al. [7] tried to extract sound signals based on frequency, duration, and noise. Meanwhile, Sakar et al. [8] transforming voice signals with wavelet transform and extracting voice signals based on frequency and noise. Both use a machine learning approach, because it can shorten the time in analyzing and observing the development of a disease [9]. Moreover, machine learning can automatically make it easier to process large amounts of data, including attributes of vocal sounds [10]. The sound extraction in this study was used as a data attribute for the learning process of a detection system with a machine learning approach. From the detection system in both studies, the results show that not all attributes make a significant contribution to the learning process. From these problems, we need a method to choose the attributes of vocal sound extraction that are useful for the detection system.
The process of selecting attributes or selecting features will be very useful to improve the performance of the detection system. In the same case, Sakar et al. [8] tested the minimum redundancy maximum relevance (MRMR) feature selection method which produces a Parzen window estimator to calculate feature correlations in the data. However, the Parzen window estimator is not reliable when the number of training samples is small and results in less relevant features. Gunduz [11] uses the ReliefF method in filtering features based on feature scores. However, ReliefF shows poor performance on data that has unbalanced classes and produces features that are less relevant for the detection system. Feature selection method implementation that has been applied previous studies cannot overcome variations in the amount of data including the number of samples against the number of features and the number of samples that are not balanced between Parkinson’s disease and normal subject. Therefore, a feature selection method is needed that can produce features that are more relevant to variations in the amount of data.
Yamada et al. [12] presents a feature selection method Hilbert Schmidt Independence Criterion Lasso (HSIC Lasso) which transforms features into kernel form to obtain more relevant features. The resulting kernel shape has sparsity which improves the calculation of feature correlation values. Damodaran et al. [13] used HSIC Lasso and succeeded in selecting high-dimensional image features and producing good prediction performance. Therefore, this study will test the HSIC Lasso feature selection method in the Parkinson’s disease detection system which is expected to produce vocal sound features that are more relevant to variations in the amount of data.
This study is related to other research based on the methods and materials used in the research process. Some of these linkages include using the selection, oversampling, and classification features in the same way or applied to the same case. The dataset material used is also taken from several studies by taking, cases, and forms in the form of feature extraction which are relatively the same. The datasets are from Little et al. [7], Naranjo et al. [14], and Sakar et al. [8]. The research of Little et al. [7] aims to introduce a new technique for analyzing voices in Parkinson’s disease. The methods used in this study are recurrence and fractal scaling. The results show that recurrence and fractal scaling are effective in regression classification with thousands of features. Naranjo et al. [14] tested the replication of voice recordings that differentiated healthy people from people with Parkinson’s disease using a subject-based Bayesian approach. The proposed system is able to distinguish cases even with a small sample. While Sakar et al. [8] examined feature extraction of tunable Q-factor wavelet transform in voice signals in cases of Parkinson’s disease. The proposed method shows a better performance than the previously applied technique.
Then application of the feature selection method used in this study was taken from the study of Yamada et al. [12]. They aims to prove the optimal solution in determining features relevant to the kernel in the feature selection method using the HSIC Lasso. The proposed method is proven to be effective in the classification and regression process with thousands of features. Meanwhile, oversampling and classification used in this study are related to research that uses selection features in the same case and dataset. In the oversampling method, the synthetic minority oversampling technique (SMOTE), also applied by Hoq et al. [15]. In [15], the authors tries to apply artificial intelligent in the case of Parkinson’s disease by extracting vocal sound features using principal component analysis and sparse autoencoder with synthetic minority oversampling. The sparse autoencoder method results in better performance in the detection system. Then from the classification method, Ouhmida et al. [16] using the k-nearest neighbors (k-NN) classification method to analyze the performance of machine learning in identifying Parkinson’s disease based on speech disorders. In addition, the MRMR and ReliefF feature selection is used. The results show that the ReliefF feature selection results in better performance. Abdullah et al. [17] applying machine learning in a Parkinson’s disease diagnosis system based on voice recordings with Lasso feature selection and support vector machine (SVM) classification. The results show that the model produces better performance than the previous method. Meanwhile, Ramchoun et al. [18] introduces a new approach to optimize network architecture with multilayer perceptron classification method. The model architecture shows a high effectiveness compared to the previously applied model.
Feature selection research using HSIC Lasso for the detection of Parkinson’s disease is carried out by creating a detection system with a machine learning approach. In making the detection system, several stages were carried out such as feature selection, data balancing with oversampling, and classification to build predictive models. The stages are shown in Figure 1. In obtaining the optimal prediction model, the hyper-parameter adjustment stage is carried out on the classification model. The data is divided into two parts, namely training data and testing data. Prediction model setup using training data. While the rest, data testing, is used to evaluate the prediction model. The distribution of the data is based on the evaluation technique used in related research. It is intended that the results of the prediction model in this study can be compared with related research. The evaluation technique used is to divide 75% of the training data and 25% of the testing data on Naranjo et al. [14] dataset, 10-fold cross-validation on Little et al. [7] dataset, and leave one subject out cross-validation on Sakar et al. [8] dataset.
The dataset that will be used in this study was taken from the UCI (University of California at Irvine) machine learning database. Data was collected by selecting the same case (Parkinson’s disease), relatively similar features (the extraction of vocal cords of people with Parkinson’s disease), and the same sampling method. In this study, the authors propose several data sets that meet these prerequisites, which are shown in Table 1.
The data acquisition is classified as a physical test by taking the sound of the vowel “a” as many as three samples per subject, except for the Little et al. [7] dataset who took six samples per subject. The use of the vowel “a” already represents all vowel pronunciations [7]. Voice capture uses a microphone whose frequency is set at 44.1 kHz. The results of the sound extraction are grouped into the feature subsets described in Table 2.
In the research of Hoq et al. [15] used a method to solve the problem of unbalanced data in machine learning. The method is called SMOTE. The SMOTE method oversampling the class that is classified as a minority. The resulting new samples are synthetic parts of the nearest minority class data. The closest data distance used for the new sample is randomly selected. In Figure 2, the synthetic data (red triangle) generated from the minor class (blue triangle) makes the two data classes balanced.
The dataset used by Sakar et al. [8] and Little et al. [7] have a ratio of data on people with Parkinson’s disease with normal individuals in the range of 3:1. The ratio of the amount of data concludes that the dataset is unbalanced. This can cause the predictive model to over-fit (a situation when the model is too good in the training process) in one class and ignore the minor class. Therefore, the author will test oversampling using SMOTE to add synthetic data to an unbalanced dataset. The method will select a sample of adjacent data from the minor class and balance the ratio of the numbers of data to 1:1. The oversampling process is only carried out on the training data. This is because the synthetic data produced are not relevant to be used to evaluate the model [15].
The least absolute shrinkage and selection operator (Lasso) method is an operator used for the selection and shrinkage of the smallest absolute value [19]. Lasso will optimize the case by making the regression coefficient of irrelevant and/or redundant features (features that have a low correlation with the error function) to be precise or close to 0. The Lasso optimization case is shown as follows:
where
In the research of Yamada et al. [12] proposed the HSIC Lasso method. This method is an alternative implementation of nonlinear feature-wise Lasso which aims to obtain sparsity in features by forming a Hilbert-Schmidt reproducing lernel. The optimization problem for HSIC Lasso is shown in
where ||·||
In Figure 4 the input and target data are normalized using a z-score (centering with the average value and shrinking with a standard deviation) to facilitate the learning process. Then, HSIC Lasso uses the gaussian kernel for the input
where
Least-angle regression (LARS) is a step-by-step procedure used to simplify the computation of all types of Lasso cases [20]. This method will estimate the Lasso regression coefficient in stages until the feature has a strong correlation with the output, so that in practice it is not necessary to manually adjust the regularization parameter in the HSIC Lasso method.
In each iteration of the LARS algorithm stage, a calculation is performed on the feature index set A, with {1, 2, . . .,
where
In
After the stage ends, the features are selected based on the
In this study, a feature selection method was proposed using the HSIC Lasso method. The HSIC Lasso method has parameters
The processed data will be used as training material for prediction models for people with Parkinson’s disease. In machine learning, the data learning stage is contained in the classification process. This study uses several classification methods which are described in the section below.
The basic concept of k-NN is classification based on its closest neighbors [21]. The k-NN method has two stages. The first stage is the process of determining the nearest neighbor. The second stage is the process of determining the class using the neighbors that have been determined. The dominant class of neighbors will determine the predicted data class. Neighbor selection is usually done by measuring distance metrics. Figure 5 shows the process of determining the 5 nearest neighbors. The dominant class (blue triangle object) makes prediction defined as that class.
SVM is a classification method that maps input vector values into higher dimensions [22]. SVM forms a hyperlane (line that separates the two classes) on each side of the shared data (see Figure 6). This classification model has been tested well in previous studies depending on how the parameter settings affect the generalization error value (the measurement value of the algorithm’s accuracy in predicting invisible data).
Multilayer perceptron (MLP) is the development of artificial neural networks (ANN) for the classification process by studying pattern data [18]. MLP has neurons that are organized and always connected from the top layer to the bottom layer (see Figure 7). In the hidden layer there is at least one layer between the input and output. The number of input layers is the same as the number of problem patterns and the number of output layers is the same as the number of classes. The number of hidden layers that will be used in this method can be set freely.
In this study, several classification methods were used to examine vocal datasets of patients with Parkinson’s disease. It aims to produce a detection system with the best detection performance. The classification method used has different hyperparameters. The author will set the hyper-parameters of the classification method to the ranges described in Table 3. Evaluate the tuning using 10-fold cross-validation based on the area under the ROC curve (AUC) parameter value. In the skicit-learn learning package, the setup process by trying all combinations of hyper-parameters is called grid search cross-validation (Grid-SearchCV).
The test is carried out by comparing the combination of using oversampling and/or feature selection methods without using these methods. The comparison is evaluated based on the classification performance parameters. Then proceed with testing using a predetermined classification. The settings are evaluated based on the value of the AUC parameter. The classification model with the best AUC score is used to adjust the hyperparameters for further evaluation. While the validation of the prediction model uses 25% testing, 10-fold cross-validation, and leave one subject out cross-validation in accordance with previous research. The best prediction model for each dataset will be compared with previous studies based on the same dataset and validation technique.
The results of the prediction model are evaluated based on the confusion matrix which is calculated into classification performance such as accuracy, sensitivity, precision, F1 score, and AUC-ROC curve. The following is an explanation of these parameters:
Accuracy is the probability that the predicted fraction is correct from the entire population. The formula for accuracy is shown below:
Sensitivity, also known as recall, is the probability of a positive fraction of the predicted correct fraction. The formula for the sensitivity is shown below:
Precision is the probability that a positive fraction of the total fraction should be positive. The formula for precision is shown below:
The harmonic parameters composed of sensitivity and precision are called F1 scores. The formula for the F1 score is shown below:
Area under the curve of receiver characteristic operator (AUC-ROC) is a probability curve and performance calculation that is usually used for binary classification problems. In the ROC curve of Figure 8, the
The accuracy parameters and the number of features are used to compare the prediction model with related studies that have the same data set. The AUC score is used to evaluate the predictive model tuning process. While the ROC curve in both classes, a value of 1 for people with Parkinson’s disease and a value of 0 for normal individuals, was used to evaluate the performance of the final prediction model.
The author tries to test the effect of the HSIC Lasso feature selection method and the SMOTE oversampling method on the prediction model. Testing is done by testing the model without using the HSIC Lasso SMOTE method. The evaluation of the prediction model without using HSIC Lasso SMOTE uses the parameters of accuracy, sensitivity, precision, F1, and AUC scores on all datasets that have been determined. Meanwhile, the evaluation technique used in each dataset is in accordance with the related research evaluation technique. The dataset evaluation technique Little et al. [7] dataset using 25% of data testing, Naranjo et al. [14] dataset using 10-fold cross-validation (CV), and the Sakar et al. [8] dataset using leave one subject out cross-validation (LOSOCV).
Prediction model in Little et al. [7] dataset which is shown in Table 4, produces good AUC values for the k-NN and MLP classification methods. While the AUC value in the SVM classification method has a fairly large distance, reaching 10%. When viewed from the performance range of sensitivity and precision parameters, the value of the prediction model parameters using the SVM classification is not too far from other classifications.
Naranjo et al. [14] dataset show that the three classification methods produce accuracy below 80% with the MLP method as the best classification in this model (see Table 5). Likewise with the AUC parameter value which has a long distance with a value of one. This predictive model is still not suitable for use for Parkinson’s disease detection systems. 25% of data testing is used to evaluate the model without HSIC Lasso SMOTE.
While the prediction model using the Sakar et al. [8] dataset, the sensitivity of each model has better results than its precision (see Table 6). This dataset has higher data on people with Parkinson’s disease than in normal individuals. This condition is an over-fitting in the class of people with Parkinson’s disease.
The author tries to test the unbalanced data in the Little et al. [7] dataset and the Sakar et al. [8] dataset to determine the effect of unbalanced data in the prediction model. From the two datasets, the normal class of individuals has less data than the class of people with Parkinson’s disease. The performance evaluated for this stage is accuracy, sensitivity, precision, F1, and AUC scores for all classical models that have been determined.
The prediction model using the k-NN classification in Table 7 shows better results than the prediction model without using the HSIC Lasso feature selection shown in Table 4. The value of the AUC parameter in the model is also very close to the value of one. However, the other classification methods experienced a slight decrease of 2% in the accuracy parameter. The number of features selected in this dataset is 8 out of 22 features. Slightly better number of features using the HSIC Lasso k-NN model. Some of the selected features include spread1 (variation of 1 frequency spread), spread2 (variation of 2 frequency spread), DFA (signal fractal scaling exponent), MDVP:Fhi (Hz) (maximum vocal fundamental frequency), MDVP:Fo (Hz) (average vocal fundamental frequency), HNR (harmonic to noise ratio), RPDE (nonlinear dynamical complexity measures), and PPE (nonlinear measure of fundamental frequency variation).
Then, Lasso’s HSIC prediction model on the Sakar et al. [8] dataset resulted in a better sensitivity parameter value than the precision parameter value (see Table 8). The dataset is unbalanced data. The number of people with Parkinson’s is more than normal individuals. The features used in this HSIC Lasso model are 60 features out of 754 features. The AUC parameter value in this model is also far from the value 1. When compared with the model without using HSIC Lasso, the HSIC Lasso model shows an increase in the model using the SVM classification method. However, the other classification methods show a slight decrease in the accuracy parameter.
This study applies the stages of oversampling, feature selection, and setting the classification method in stages. The process of oversampling on unbalanced datasets resulted in a ratio of data comparison between people with Parkinson’s disease and normal individuals being 1:1. The oversampling process on the Little et al. [7] dataset chose a minor class of normal individuals with 48 samples to 147 samples, so that the total data in the training Little et al. [7] dataset have 294 data samples. Likewise with the Sakar et al. [8] dataset which has 192 samples of normal individuals to 564 samples and brings the total data to 1, 128 data samples. However, oversampling is only done on the training data. Therefore, the results of oversampling depend on the distribution of data for each evaluation technique. Then at the feature selection stage, the training data is used for learning materials for the HSIC Lasso method to select features that are relevant and not redundant. The results of feature selection produce data in the form of the selected feature index and the regression coefficient value which is the weight of the feature selection. Unselected features have a zero value on the regression coefficient.
The results of using the HSIC Lasso method, which selects features based on correlation values, ranks the regression coefficient values in selecting relevant features. From the research Little et al. [7] dataset which consisted of 22 features, 8 features were selected using HSIC Lasso. The most important feature in this dataset is the feature of spread1 (variation of 1 frequency spread) and followed by spread2 (variation of 2 frequencies spread) which is shown in Figure 9. While other features are basic features of frequency such as DFA, MDVP:Fhi, MDVP:Fo, HNR, RPDE, and PPE. When compared with the results of feature selection on unbalanced data, the selected features in the Little et al. [7] dataset have the same type and number. This shows that the data imbalance does not affect the selected features in Little et al. [7] dataset.
The selected features consist of four subsets of time frequency features and 4 subsets of baseline voice features. The comparison graph between the selected features and the total features is shown in Figure 10. Most of the selected features are in the baseline subset which has a selected feature ratio of 4:17. While the time frequency feature subset has a selected feature ratio of 4:5. In the Little et al. [7] dataset the time frequency feature subset is very influential in the prediction model. However, it has not concluded that the subset of features is the most important feature in the vocal sound signal. It should be underlined that this condition only occurs based on the data collected in the Little et al. [7].
Meanwhile, in the Naranjo et al. [14] dataset which has 45 features, 18 similar features were selected in each evaluation using a different classification. The selected features are shown in Figure 11. The feature that has the highest similarity value is the HNR_35 feature (the amount of the harmonic to noise ratio in the 0–3, 500 Hz frequency range). While the remaining features have a large similarity value range with the HNR_35 feature. HNR_35 belongs to the baseline voice feature subset.
Of the 18 selected features, there are 14 subsets of the mel frequency feature and four subsets of the baseline voice feature. Figure 12 explains that from 26 subsets of selected mel frequency features 12 irrelevant and redundant features and from 19 selected baseline voice feature subsets 15 irrelevant and redundant features. Naranjo et al. [14] dataset mostly discard the baseline voice feature subset and make the mel frequency feature subset more important to use for prediction models.
In the Sakar et al. [8] dataset which has 754 features selected 50 features using HSIC Lasso. In contrast to other data, Sakar et al. [8] dataset has a subset of vocal fold, wavelet transform, and TQWT (tunable Q-factor wavelet transform) features. The four features that have the highest similarity values are tqwt_TKEO_std_dec_12 (standard deviation of the 12th level Teager-Kaiser energy operator), tqwt_kurtosisValue_dec_26 (quantity used to define distribution on signal level 26), tqwt_kurtosisValue_dec_20 (quantity which is used to define the signal distribution in the 20th level wavelet transform), tqwt_entropy_shanon_dec_12 (12th degree wavelet transform probability distribution).
If seen in Figure 13, the most selected features belong to the TQWT feature subset. The number of features selected from these features is 36 out of 432 features. While the other features consist of 6 subsets of the mel frequency feature, one subset of the baseline voice feature, three subsets of the vocal fold feature, one subset of the time frequency feature, and three subsets of the wavelet transform feature. The wavelet transform feature subset has a ratio of the difference between the selected features and the smallest total features of 1:182 compared to the TQWT feature subset of 1:12. This shows that Sakar et al. [8] dataset has an important subset of features namely TQWT.
The features selected after the feature selection stage will be used in the model setting stage for each classification method. The results of the adjustment are shown in Table 9. Performance evaluation on the prediction model uses the AUC parameter value. The setup uses the GridSearchCV method with a fold parameter value of 10. The number of class ratios in each fold is the same.
Then, the prediction model with the best hyper-parameters in each classification method is used for evaluation using testing data and evaluation techniques according to previous studies. Evaluation uses the parameters of accuracy, sensitivity, precision, F1, and AUC. In addition, there is also an ROC curve to determine a good classification method for the prediction model.
Table 10 shows that the HSIC Lasso SMOTE model produces better performance than without using oversampling and feature selection methods. The best accuracy performance results were obtained by the MLP classification method with a difference of 8.08% from the model without HSIC Lasso SMOTE. However, models that use the SVM classification result in lower accuracy performance, even though they have better AUC performance. The process of balancing the data using the SMOTE method also increases the accuracy performance by 5.50% in the MLP classification method. However, when compared to other classification methods, the SMOTE method does not have a significant effect.
Figure 14 shows the results of both ROC curves that are good for the k-NN and MLP classification methods. Especially in the MLP classification method which is worth 99.70%. While in the SVM classification method, the AUC value has a fairly far range from other classification methods. The left side of the curve is targeted at a class of normal individuals with an index of 0. On this curve, the SVM method tends to predict people with Parkinson’s disease better than normal individuals. Likewise, the left side of the curve targets the class of people with Parkinson’s disease with index 1. This ROC curve concludes that the MLP prediction model is better used for the prediction model.
In the Naranjo et al. [14] dataset no oversampling was carried out. This is because the dataset has balanced data in each class. As a result, the use of SMOTE oversampling did not have any impact on the dataset. The HSIC Lasso SMOTE prediction model on the Naranjo et al. [14] dataset resulted in better performance than without using the feature selection method (see Table 11). The best accuracy performance results were obtained by the MLP classification method with a difference of 8.67% from the model without HSIC Lasso SMOTE. While the k-NN method shows poor results and has a performance range that is quite far from other classification methods.
However, if you look at the ROC curve shown in Figure 15, the results of the AUC values in the three classification methods are not much different. The MLP method has the highest AUC value with a value of 88.33%. If we look at the two curves, the ability of the MLP classification method in predicting the class of people with Parkinson’s disease and normal individuals is the same. Then, the SVM method has an AUC value of 86.89%. Comparison of the ability to predict people with Parkinson’s disease with normal individuals. Likewise with the k-NN classification method with an AUC value of 83.03%, although it has the lowest accuracy performance compared to other classification methods. This concludes that the MLP method is also good for predicting models of the Naranjo et al. [14] dataset.
Meanwhile, Table 12 shows that the HSIC Lasso SMOTE model improves accuracy by 15.78% on k-NN, 18.26% on SVM, 15.06% on MLP from the model without using HSIC Lasso SMOTE. The best classification method was achieved by the MLP method. The comparison between the sensitivity and precision values in each classification method is not much different. This shows that the three methods are able to predict with balanced results. The data that has been balanced after the oversampling process using the SMOTE method also improves the performance of the prediction model.
While on the ROC curve the prediction model shows good results in the three classification methods. The method that produces the highest AUC value is the MLP method with a value of 99.22% (see Figure 16). It also concludes that the MLP method is good for predicting models in the Sakar et al. [8] dataset. The performance results of the three datasets produce good performance on the prediction model using the MLP classification method. The sensitivity and precision parameters have a value comparison that is not too far away and has a fairly high F1 parameter. Compared to other datasets, Naranjo et al. [14] dataset have a fairly low performance using the HSIC Lasso method or not.
From the previous experiment, HSIC Lasso provides better performance than the prediction model that does not use this method. Moreover, the use of SMOTE to balance the data in the Little et al. [7] dataset and Sakar et al. [8] dataset strengthens the HSIC Lasso method in evaluating data in the feature selection process. The author tries to compare the HSIC Lasso method with related studies based on the same dataset and using the feature selection method.
Compared to other studies, the Lasso HSIC prediction model produces fewer features and better accuracy performance in the Little et al. [7] dataset. Feature subset used in this study is contained in the feature subset used in the related study [16, 23, 24] (see Table 13). Although the features used are relatively the same, especially in the study of Ouhmida et al. [16], the classification results using the k-NN method produce lower accuracy. The k-NN method used as a predictive model in this study has been set with the best hyper-parameters. However, the MLP classification method is able to match the accuracy performance compared to other studies. Besides being compared to the same dataset, this comparison is based on evaluating the same predictive model using 10-fold cross-validation.
While in the Naranjo et al. [14] dataset, the HSIC Lasso method does not produce fewer features than other studies [17, 25] (see Table 14). The features used in other studies are of the same type and are contained in the subset of features used in this study. The method used in other studies is also the root of the HSIC Lasso method, namely the Lasso feature selection method. As explained earlier, the Lasso method does not transform data into kernel form. So there is no sparsity or data void that makes the correlation between features and output even greater. However, in the case of the Naranjo et al. [14] dataset which has 45 features, 240 samples, and balanced data classes, the HSIC Lasso model is not suitable for use. Although it is not well used, the MLP method produces better performance than other studies that use a smaller number of features. This concludes that the Lasso HSIC model in the case of the Naranjo et al. [14] dataset are not well used for detection systems. This dataset does not use the oversampling method, because the data is already balanced and has no effect if applied.
The results of the Lasso HSIC selection method on the Sakar et al. [8] dataset yields a non-distinguishable number of features based on the total features in the dataset (see Table 15). On the features used by research related to the Sakar et al. [8] dataset does not show in detail what features were selected. However, in the study of Sakar et al. [8] shows the distribution of features that have been grouped as in Table 2. Table 16 compares the distribution of features based on these groupings with the MRMR SVM-RBF model in the research of Sakar et al. [8] dataset.
Regardless of feature distribution, the proposed predictive model yields better performance than other studies. Especially in the HSIC Lasso SMOTE MLP method which produces a difference of 10.18% with the study of Sakar et al. [8] which has the same number of features. In this dataset, the use of the SMOTE method greatly affects the performance of the prediction model. When compared to prediction models that do not use SMOTE, the results of the prediction model performance are not much different from other studies. In every model variation that has been tested, HSIC Lasso always provides the best results in handling feature selection with very large amounts of data based on the accuracy of the predictions produced. Moreover, in every case that has been tested, HSIC Lasso can reduce features effectively. So that it is able to provide better performance with a minimal number of features.
Proposed method has proven successful in selecting vocal voice features and has a good performance in developing a Parkinson’s disease detection system. Likewise with the prediction model of related research, HSIC Lasso method can exceed the accuracy performance and produce fewer features in some conditions depending on the data tested from previous studies. Variation of data includes the number of samples and features. Number of samples comparison between two classes also affects HSIC Lasso calculation in determining the number of relevant and not redundant features. HSIC Lasso has proven to be more effective in building predictive models on balanced data with larger features and data samples.
HSIC Lasso feature selection method will be better used in high-dimensional data. High-dimensional data will make the data that has been transformed into a Hilbert Schmidt Kernel produce a stronger similarity value between the feature data and the output data. Then the lack of this study is that it does not focus on other aspect of relation between data feature and subject condition. The author generalizes the condition of the subject and ignores some features such age, gender, and the number of voice recordings on one subject. This is potentially important in evaluating human vocal voice data. Because the voices produced by people in different age or gender are very different.
No potential conflict of interest relevant to this article was reported.
Oversampling process using synthetic minority oversampling technique: (a) unbalanced dataset, (b) adding synthetic data from minority class, and (c) balanced dataset.
ROC graph result: (a) targeting Parkinson’s disease class and (b) targeting normal class Sakar et al. [
ROC graph result: (a) targeting Parkinson’s disease class and (b) targeting normal class Naranjo et al. [
ROC graph result: (a) targeting Parkinson’s disease class and (b) targeting normal class Sakar et al. [
Table 1. Details of data used.
Dataset | Number of features | Number of samples | Parkinson’ disease samples | Normal samples |
---|---|---|---|---|
Little et al. [7] | 22 | 195 | 147 | 48 |
Naranjo et al. [14] | 45 | 240 | 120 | 120 |
Sakar et al. [8] | 754 | 756 | 564 | 192 |
Table 2. Feature subset grouping.
Number | Feature subset | Description |
---|---|---|
I | Voice Baseline | Basic value of sound extraction |
II | Vocal Fold | Vocal cords extraction |
III | Time Frequency | Sound frequency time extraction |
IV | Mel Frequency | Short-term power spectrum coefficient |
V | Wavelet Transform | Feature extraction from wavelet transform |
VI | Tunable Q-factor Wavelet Transform | Feature extraction from wavelet transform with Q-factor |
Table 3. Hyper-parameter tuning.
Classification | Hyper-parameter |
---|---|
k-NN | |
SVM | C: [0.1, 0.5, 1.0], kernel: [rbf, poly, sigmoid], gamma: auto |
MLP | solver: [lbfgs, sgd, adam], hidden layer: [200, 300, 400], learning rate: [constant, invscaling, adaptive], iterasi: 3000, learning rate init: 0.0001 |
Table 4. Performance test result without feature selection Little et al. [7] dataset (unit: %).
Classification method | Accuracy | Sensitivity | Precision | F1 | AUC |
---|---|---|---|---|---|
k-NN | 91.28 | 91.28 | 91.50 | 91.36 | 98.20 |
SVM | 89.23 | 89.23 | 90.57 | 88.08 | 88.27 |
MLP | 90.76 | 90.76 | 90.57 | 90.55 | 97.03 |
Table 5. Performance test result without feature selection Naranjo et al. [14] dataset (unit: %).
Classification method | Accuracy | Sensitivity | Precision | F1 | AUC |
---|---|---|---|---|---|
k-NN | 77.00 | 77.00 | 77.00 | 76.99 | 84.17 |
SVM | 78.67 | 78.67 | 78.89 | 78.75 | 86.98 |
MLP | 79.67 | 79.67 | 79.61 | 79.61 | 87.63 |
Table 6. Performance test result without feature selection Sakar et al. [8] dataset (unit: %).
Classification method | Accuracy | Sensitivity | Precision | F1 | AUC |
---|---|---|---|---|---|
k-NN | 75.39 | 75.39 | 72.15 | 72.26 | 65.83 |
SVM | 75.00 | 75.00 | 81.27 | 64.67 | 93.38 |
MLP | 81.12 | 81.12 | 81.83 | 77.85 | 84.24 |
Table 7. Performance test result with HSIC Lasso Little et al. [7] dataset (unit: %).
Classification method | Accuracy | Sensitivity | Precision | F1 | AUC |
---|---|---|---|---|---|
k-NN | 94.87 | 94.87 | 95.10 | 94.93 | 99.16 |
SVM | 93.33 | 93.32 | 93.90 | 93.47 | 96.71 |
MLP | 91.28 | 91.28 | 91.14 | 91.18 | 96.72 |
Table 8. Performance test result with HSIC Lasso Sakar et al. [8] dataset (unit: %).
Classification method | Accuracy | Sensitivity | Precision | F1 | AUC |
---|---|---|---|---|---|
k-NN | 87.43 | 87.43 | 88.21 | 86.17 | 89.72 |
SVM | 87.43 | 87.43 | 87.05 | 87.02 | 90.13 |
MLP | 87.96 | 87.96 | 87.64 | 87.69 | 91.55 |
Table 9. Hyper-parameter test result.
Dataset | Classification | Hyper-parameter | AUC (%) |
---|---|---|---|
Little et al. [7] | k-NN | 98.13 | |
SVM | C: 1.0, Kernel: poly, Gamma: auto | 93.59 | |
MLP | Solver: adam, Hidden layer: 400, Learning rate: adaptive, Iterasi: 3000 | 99.70 | |
Naranjo et al. [14] | k-NN | 86.03 | |
SVM | C: 0.5, Kernel: poly, Gamma: auto | 86.89 | |
MLP | Solver: adam, Hidden layer: 400, Learning rate: invscaling, Iterasi: 3000 | 88.33 | |
Sakar et al. [8] | k-NN | 96.59 | |
SVM | C: 0.1, Kernel: poly, Gamma: auto | 97.72 | |
MLP | Solver: adam, Hidden layer: 400, Learning rate: invscaling, Iterasi: 3000 | 99.22 |
Table 10. Performance test result with HSIC Lasso SMOTE Sakar et al. [8] dataset (unit: %).
Classification method | Accuracy | Sensitivity | Precision | F1 | AUC |
---|---|---|---|---|---|
k-NN | 97.61 | 97.61 | 97.67 | 97.61 | 97.61 |
SVM | 96.93 | 96.93 | 97.04 | 96.93 | 98.63 |
MLP | 98.84 | 98.57 | 95.68 | 96.57 | 99.70 |
Table 11. Performance test result with HSIC Lasso SMOTE Naranjo et al. [14] dataset (unit: %).
Classification method | Accuracy | Sensitivity | Precision | F1 | AUC |
---|---|---|---|---|---|
k-NN | 79.84 | 79.83 | 75.85 | 79.84 | 86.03 |
SVM | 83.33 | 83.33 | 83.37 | 83.32 | 86.89 |
MLP | 88.34 | 80.00 | 96.00 | 87.27 | 88.33 |
Table 12. Performance test result with HSIC Lasso SMOTE Sakar et al. [8] dataset (unit: %).
Classification method | Accuracy | Sensitivity | Precision | F1 | AUC |
---|---|---|---|---|---|
k-NN | 91.17 | 91.17 | 92.82 | 91.70 | 96.59 |
SVM | 93.26 | 93.26 | 93.40 | 93.25 | 97.72 |
MLP | 96.18 | 96.18 | 96.27 | 96.18 | 99.22 |
Table 13. Comparison with a number of previous studies Little et al. [7] dataset.
Studies | Method | Number of features | Accuracy (%) |
---|---|---|---|
Ouhmida et al. [16] | ReliefF + k-NN | 8 | 97.92 |
Thapa et al. [23] | Correlation-based + TSVM | 13 | 93.90 |
Ma et al. [24] | RFE + SVM | 11 | 96.29 |
This study | SMOTE + HSIC Lasso + k-NN | 8 | 97.61 |
SMOTE + HSIC Lasso + SVM | 8 | 96.93 | |
SMOTE + HSIC Lasso + MLP | 8 | 98.84 |
Table 14. Comparison with a number of previous studies Naranjo et al. [14] dataset.
Studies | Method | Number of features | Accuracy (%) |
---|---|---|---|
Abdullah et al. [17] | Lasso + SVM | 10 | 86.90 |
Naranjo et al. [25] | Correlation-based + Lasso + Bayessian | 10 | 86.20 |
HSIC Lasso + k-NN | 18 | 75.83 | |
HSIC Lasso + SVM | 18 | 83.33 | |
HSIC Lasso + MLP | 18 | 88.34 |
Table 15. Comparison with a number of previous studies Sakar et al. [8].
Studies | Method | Number of features | Accuracy (%) |
---|---|---|---|
Gunduz [11] | ReliefF + VAE | 60 | 91.60 |
Sakar et al. [8] | MRMR + SVM-RBF | 50 | 86.00 |
SMOTE + HSIC Lasso + k-NN | 50 | 91.17 | |
SMOTE + HSIC Lasso + SVM | 50 | 93.26 | |
SMOTE + HSIC Lasso + MLP | 50 | 96.18 |
Table 16. Distribution of Sakar et al. [8] dataset features.
Studies | Selection feature methods | Selected feature subset distribution | |||||
---|---|---|---|---|---|---|---|
I | II | III | IV | V | VI | ||
Sakar et al. [8] | MRMR | 4 | 3 | 2 | 10 | 1 | 30 |
This study | HSIC Lasso | 1 | 3 | 1 | 6 | 3 | 36 |
International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(4): 482-499
Published online December 25, 2023 https://doi.org/10.5391/IJFIS.2023.23.4.482
Copyright © The Korean Institute of Intelligent Systems.
Wiharto Wiharto , Ahmad Sucipto, and Umi Salamah
Department of Informatics, Universitas Sebelas Maret, Surakarta, Indonesia
Correspondence to:Wiharto Wiharto (wiharto@staff.uns.ac.id)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Parkinson’s disease is a neurological disorder which interferes human activities. Early detection is needed to facilitate treatment before the symptoms get worse. Earlier detection used vocal voice as a comparison with normal subject. However, detection using vocal voice still has weaknesses in detection system. Vocal voice contains a lot of information that isn’t necessarily relevant for a detection system. Previous studies proposed a feature selection method on detection system. However, the proposed method can’t handle variation in the amount of data. These variations include an imbalance sample to features and classes. In answering these problems, the Hilbert-Schmidt Independence Criterion Lasso (HSIC Lasso) feature selection method is used which has feature transformation capabilities that can produce more relevant features. In addition, detection system uses Synthetic Minority Oversampling Technique (SMOTE) method to balance data and several classification methods such as k-nearest neighbors, support vector machine, and multilayer perceptron to obtain best predictive model. HSIC Lasso produces 18 of 45 features with an accuracy of 88.34% on a small sample and 50 of 754 features with an accuracy of 96.16% on a large sample. From this result, when compared with previous studies, HSIC Lasso is more suitable on balanced data with more samples and features.
Keywords: Parkinson&rsquo,s disease, Early detection, Vocal voice, HSIC Lasso
Parkinson’s disease is a chronic and progressive neurodegenerative disorders characterized by motor and non-motor dysfunction of human nerves [1]. This neurodegenerative disorder occurs in dopamine-producing cells in the substantia nigra in the midbrain [2]. When dopamine-producing cells, also known as dopaminergic cells, degenerate, dopamine levels drop. Dopamine is a neurotransmitter whose job is to bind to G protein receptors in the dopaminergic signaling system which is very important for physiological processes and balance of activities of the human body [3]. The motor dysfunction of Parkinson’s disease is characterized by muscle weakness and stiffness, slurred speech, and weight loss. Meanwhile, non-motor dysfunction in Parkinson’s disease is followed by cognitive changes, sleep disturbances, and psychiatric symptoms [4]. If these symptoms get worse, it will affect all activities of that person and is difficult to treat. Researchers continue to strive to prevent the development of Parkinson’s disease by conducting early detection, in order to make it easier for clinical officers to treat people with Parkinson’s disease.
The previous detection of Parkinson’s disease used a lot of people’s vocal voices as a comparison with normal individuals. In the study of Sakar et al. [5] said that 90% of people with Parkinson’s disease have problems with their voices. Sakar et al. [5] also proved by testing the pronunciation of letters, numbers, and some words for people with Parkinson’s disease and normal individuals. The test results show that the pronunciation of vowels provides more discriminatory information in cases of Parkinson’s disease. That’s because, voice noise as a comparison characteristic of disease is more clearly heard in the pronunciation of vowel sounds [6]. However, the use of human vocal sounds still has weaknesses in the detection system.
Human vocal sound has a lot of information that is not necessarily useful for the detection process. Little et al. [7] tried to extract sound signals based on frequency, duration, and noise. Meanwhile, Sakar et al. [8] transforming voice signals with wavelet transform and extracting voice signals based on frequency and noise. Both use a machine learning approach, because it can shorten the time in analyzing and observing the development of a disease [9]. Moreover, machine learning can automatically make it easier to process large amounts of data, including attributes of vocal sounds [10]. The sound extraction in this study was used as a data attribute for the learning process of a detection system with a machine learning approach. From the detection system in both studies, the results show that not all attributes make a significant contribution to the learning process. From these problems, we need a method to choose the attributes of vocal sound extraction that are useful for the detection system.
The process of selecting attributes or selecting features will be very useful to improve the performance of the detection system. In the same case, Sakar et al. [8] tested the minimum redundancy maximum relevance (MRMR) feature selection method which produces a Parzen window estimator to calculate feature correlations in the data. However, the Parzen window estimator is not reliable when the number of training samples is small and results in less relevant features. Gunduz [11] uses the ReliefF method in filtering features based on feature scores. However, ReliefF shows poor performance on data that has unbalanced classes and produces features that are less relevant for the detection system. Feature selection method implementation that has been applied previous studies cannot overcome variations in the amount of data including the number of samples against the number of features and the number of samples that are not balanced between Parkinson’s disease and normal subject. Therefore, a feature selection method is needed that can produce features that are more relevant to variations in the amount of data.
Yamada et al. [12] presents a feature selection method Hilbert Schmidt Independence Criterion Lasso (HSIC Lasso) which transforms features into kernel form to obtain more relevant features. The resulting kernel shape has sparsity which improves the calculation of feature correlation values. Damodaran et al. [13] used HSIC Lasso and succeeded in selecting high-dimensional image features and producing good prediction performance. Therefore, this study will test the HSIC Lasso feature selection method in the Parkinson’s disease detection system which is expected to produce vocal sound features that are more relevant to variations in the amount of data.
This study is related to other research based on the methods and materials used in the research process. Some of these linkages include using the selection, oversampling, and classification features in the same way or applied to the same case. The dataset material used is also taken from several studies by taking, cases, and forms in the form of feature extraction which are relatively the same. The datasets are from Little et al. [7], Naranjo et al. [14], and Sakar et al. [8]. The research of Little et al. [7] aims to introduce a new technique for analyzing voices in Parkinson’s disease. The methods used in this study are recurrence and fractal scaling. The results show that recurrence and fractal scaling are effective in regression classification with thousands of features. Naranjo et al. [14] tested the replication of voice recordings that differentiated healthy people from people with Parkinson’s disease using a subject-based Bayesian approach. The proposed system is able to distinguish cases even with a small sample. While Sakar et al. [8] examined feature extraction of tunable Q-factor wavelet transform in voice signals in cases of Parkinson’s disease. The proposed method shows a better performance than the previously applied technique.
Then application of the feature selection method used in this study was taken from the study of Yamada et al. [12]. They aims to prove the optimal solution in determining features relevant to the kernel in the feature selection method using the HSIC Lasso. The proposed method is proven to be effective in the classification and regression process with thousands of features. Meanwhile, oversampling and classification used in this study are related to research that uses selection features in the same case and dataset. In the oversampling method, the synthetic minority oversampling technique (SMOTE), also applied by Hoq et al. [15]. In [15], the authors tries to apply artificial intelligent in the case of Parkinson’s disease by extracting vocal sound features using principal component analysis and sparse autoencoder with synthetic minority oversampling. The sparse autoencoder method results in better performance in the detection system. Then from the classification method, Ouhmida et al. [16] using the k-nearest neighbors (k-NN) classification method to analyze the performance of machine learning in identifying Parkinson’s disease based on speech disorders. In addition, the MRMR and ReliefF feature selection is used. The results show that the ReliefF feature selection results in better performance. Abdullah et al. [17] applying machine learning in a Parkinson’s disease diagnosis system based on voice recordings with Lasso feature selection and support vector machine (SVM) classification. The results show that the model produces better performance than the previous method. Meanwhile, Ramchoun et al. [18] introduces a new approach to optimize network architecture with multilayer perceptron classification method. The model architecture shows a high effectiveness compared to the previously applied model.
Feature selection research using HSIC Lasso for the detection of Parkinson’s disease is carried out by creating a detection system with a machine learning approach. In making the detection system, several stages were carried out such as feature selection, data balancing with oversampling, and classification to build predictive models. The stages are shown in Figure 1. In obtaining the optimal prediction model, the hyper-parameter adjustment stage is carried out on the classification model. The data is divided into two parts, namely training data and testing data. Prediction model setup using training data. While the rest, data testing, is used to evaluate the prediction model. The distribution of the data is based on the evaluation technique used in related research. It is intended that the results of the prediction model in this study can be compared with related research. The evaluation technique used is to divide 75% of the training data and 25% of the testing data on Naranjo et al. [14] dataset, 10-fold cross-validation on Little et al. [7] dataset, and leave one subject out cross-validation on Sakar et al. [8] dataset.
The dataset that will be used in this study was taken from the UCI (University of California at Irvine) machine learning database. Data was collected by selecting the same case (Parkinson’s disease), relatively similar features (the extraction of vocal cords of people with Parkinson’s disease), and the same sampling method. In this study, the authors propose several data sets that meet these prerequisites, which are shown in Table 1.
The data acquisition is classified as a physical test by taking the sound of the vowel “a” as many as three samples per subject, except for the Little et al. [7] dataset who took six samples per subject. The use of the vowel “a” already represents all vowel pronunciations [7]. Voice capture uses a microphone whose frequency is set at 44.1 kHz. The results of the sound extraction are grouped into the feature subsets described in Table 2.
In the research of Hoq et al. [15] used a method to solve the problem of unbalanced data in machine learning. The method is called SMOTE. The SMOTE method oversampling the class that is classified as a minority. The resulting new samples are synthetic parts of the nearest minority class data. The closest data distance used for the new sample is randomly selected. In Figure 2, the synthetic data (red triangle) generated from the minor class (blue triangle) makes the two data classes balanced.
The dataset used by Sakar et al. [8] and Little et al. [7] have a ratio of data on people with Parkinson’s disease with normal individuals in the range of 3:1. The ratio of the amount of data concludes that the dataset is unbalanced. This can cause the predictive model to over-fit (a situation when the model is too good in the training process) in one class and ignore the minor class. Therefore, the author will test oversampling using SMOTE to add synthetic data to an unbalanced dataset. The method will select a sample of adjacent data from the minor class and balance the ratio of the numbers of data to 1:1. The oversampling process is only carried out on the training data. This is because the synthetic data produced are not relevant to be used to evaluate the model [15].
The least absolute shrinkage and selection operator (Lasso) method is an operator used for the selection and shrinkage of the smallest absolute value [19]. Lasso will optimize the case by making the regression coefficient of irrelevant and/or redundant features (features that have a low correlation with the error function) to be precise or close to 0. The Lasso optimization case is shown as follows:
where
In the research of Yamada et al. [12] proposed the HSIC Lasso method. This method is an alternative implementation of nonlinear feature-wise Lasso which aims to obtain sparsity in features by forming a Hilbert-Schmidt reproducing lernel. The optimization problem for HSIC Lasso is shown in
where ||·||
In Figure 4 the input and target data are normalized using a z-score (centering with the average value and shrinking with a standard deviation) to facilitate the learning process. Then, HSIC Lasso uses the gaussian kernel for the input
where
Least-angle regression (LARS) is a step-by-step procedure used to simplify the computation of all types of Lasso cases [20]. This method will estimate the Lasso regression coefficient in stages until the feature has a strong correlation with the output, so that in practice it is not necessary to manually adjust the regularization parameter in the HSIC Lasso method.
In each iteration of the LARS algorithm stage, a calculation is performed on the feature index set A, with {1, 2, . . .,
where
In
After the stage ends, the features are selected based on the
In this study, a feature selection method was proposed using the HSIC Lasso method. The HSIC Lasso method has parameters
The processed data will be used as training material for prediction models for people with Parkinson’s disease. In machine learning, the data learning stage is contained in the classification process. This study uses several classification methods which are described in the section below.
The basic concept of k-NN is classification based on its closest neighbors [21]. The k-NN method has two stages. The first stage is the process of determining the nearest neighbor. The second stage is the process of determining the class using the neighbors that have been determined. The dominant class of neighbors will determine the predicted data class. Neighbor selection is usually done by measuring distance metrics. Figure 5 shows the process of determining the 5 nearest neighbors. The dominant class (blue triangle object) makes prediction defined as that class.
SVM is a classification method that maps input vector values into higher dimensions [22]. SVM forms a hyperlane (line that separates the two classes) on each side of the shared data (see Figure 6). This classification model has been tested well in previous studies depending on how the parameter settings affect the generalization error value (the measurement value of the algorithm’s accuracy in predicting invisible data).
Multilayer perceptron (MLP) is the development of artificial neural networks (ANN) for the classification process by studying pattern data [18]. MLP has neurons that are organized and always connected from the top layer to the bottom layer (see Figure 7). In the hidden layer there is at least one layer between the input and output. The number of input layers is the same as the number of problem patterns and the number of output layers is the same as the number of classes. The number of hidden layers that will be used in this method can be set freely.
In this study, several classification methods were used to examine vocal datasets of patients with Parkinson’s disease. It aims to produce a detection system with the best detection performance. The classification method used has different hyperparameters. The author will set the hyper-parameters of the classification method to the ranges described in Table 3. Evaluate the tuning using 10-fold cross-validation based on the area under the ROC curve (AUC) parameter value. In the skicit-learn learning package, the setup process by trying all combinations of hyper-parameters is called grid search cross-validation (Grid-SearchCV).
The test is carried out by comparing the combination of using oversampling and/or feature selection methods without using these methods. The comparison is evaluated based on the classification performance parameters. Then proceed with testing using a predetermined classification. The settings are evaluated based on the value of the AUC parameter. The classification model with the best AUC score is used to adjust the hyperparameters for further evaluation. While the validation of the prediction model uses 25% testing, 10-fold cross-validation, and leave one subject out cross-validation in accordance with previous research. The best prediction model for each dataset will be compared with previous studies based on the same dataset and validation technique.
The results of the prediction model are evaluated based on the confusion matrix which is calculated into classification performance such as accuracy, sensitivity, precision, F1 score, and AUC-ROC curve. The following is an explanation of these parameters:
Accuracy is the probability that the predicted fraction is correct from the entire population. The formula for accuracy is shown below:
Sensitivity, also known as recall, is the probability of a positive fraction of the predicted correct fraction. The formula for the sensitivity is shown below:
Precision is the probability that a positive fraction of the total fraction should be positive. The formula for precision is shown below:
The harmonic parameters composed of sensitivity and precision are called F1 scores. The formula for the F1 score is shown below:
Area under the curve of receiver characteristic operator (AUC-ROC) is a probability curve and performance calculation that is usually used for binary classification problems. In the ROC curve of Figure 8, the
The accuracy parameters and the number of features are used to compare the prediction model with related studies that have the same data set. The AUC score is used to evaluate the predictive model tuning process. While the ROC curve in both classes, a value of 1 for people with Parkinson’s disease and a value of 0 for normal individuals, was used to evaluate the performance of the final prediction model.
The author tries to test the effect of the HSIC Lasso feature selection method and the SMOTE oversampling method on the prediction model. Testing is done by testing the model without using the HSIC Lasso SMOTE method. The evaluation of the prediction model without using HSIC Lasso SMOTE uses the parameters of accuracy, sensitivity, precision, F1, and AUC scores on all datasets that have been determined. Meanwhile, the evaluation technique used in each dataset is in accordance with the related research evaluation technique. The dataset evaluation technique Little et al. [7] dataset using 25% of data testing, Naranjo et al. [14] dataset using 10-fold cross-validation (CV), and the Sakar et al. [8] dataset using leave one subject out cross-validation (LOSOCV).
Prediction model in Little et al. [7] dataset which is shown in Table 4, produces good AUC values for the k-NN and MLP classification methods. While the AUC value in the SVM classification method has a fairly large distance, reaching 10%. When viewed from the performance range of sensitivity and precision parameters, the value of the prediction model parameters using the SVM classification is not too far from other classifications.
Naranjo et al. [14] dataset show that the three classification methods produce accuracy below 80% with the MLP method as the best classification in this model (see Table 5). Likewise with the AUC parameter value which has a long distance with a value of one. This predictive model is still not suitable for use for Parkinson’s disease detection systems. 25% of data testing is used to evaluate the model without HSIC Lasso SMOTE.
While the prediction model using the Sakar et al. [8] dataset, the sensitivity of each model has better results than its precision (see Table 6). This dataset has higher data on people with Parkinson’s disease than in normal individuals. This condition is an over-fitting in the class of people with Parkinson’s disease.
The author tries to test the unbalanced data in the Little et al. [7] dataset and the Sakar et al. [8] dataset to determine the effect of unbalanced data in the prediction model. From the two datasets, the normal class of individuals has less data than the class of people with Parkinson’s disease. The performance evaluated for this stage is accuracy, sensitivity, precision, F1, and AUC scores for all classical models that have been determined.
The prediction model using the k-NN classification in Table 7 shows better results than the prediction model without using the HSIC Lasso feature selection shown in Table 4. The value of the AUC parameter in the model is also very close to the value of one. However, the other classification methods experienced a slight decrease of 2% in the accuracy parameter. The number of features selected in this dataset is 8 out of 22 features. Slightly better number of features using the HSIC Lasso k-NN model. Some of the selected features include spread1 (variation of 1 frequency spread), spread2 (variation of 2 frequency spread), DFA (signal fractal scaling exponent), MDVP:Fhi (Hz) (maximum vocal fundamental frequency), MDVP:Fo (Hz) (average vocal fundamental frequency), HNR (harmonic to noise ratio), RPDE (nonlinear dynamical complexity measures), and PPE (nonlinear measure of fundamental frequency variation).
Then, Lasso’s HSIC prediction model on the Sakar et al. [8] dataset resulted in a better sensitivity parameter value than the precision parameter value (see Table 8). The dataset is unbalanced data. The number of people with Parkinson’s is more than normal individuals. The features used in this HSIC Lasso model are 60 features out of 754 features. The AUC parameter value in this model is also far from the value 1. When compared with the model without using HSIC Lasso, the HSIC Lasso model shows an increase in the model using the SVM classification method. However, the other classification methods show a slight decrease in the accuracy parameter.
This study applies the stages of oversampling, feature selection, and setting the classification method in stages. The process of oversampling on unbalanced datasets resulted in a ratio of data comparison between people with Parkinson’s disease and normal individuals being 1:1. The oversampling process on the Little et al. [7] dataset chose a minor class of normal individuals with 48 samples to 147 samples, so that the total data in the training Little et al. [7] dataset have 294 data samples. Likewise with the Sakar et al. [8] dataset which has 192 samples of normal individuals to 564 samples and brings the total data to 1, 128 data samples. However, oversampling is only done on the training data. Therefore, the results of oversampling depend on the distribution of data for each evaluation technique. Then at the feature selection stage, the training data is used for learning materials for the HSIC Lasso method to select features that are relevant and not redundant. The results of feature selection produce data in the form of the selected feature index and the regression coefficient value which is the weight of the feature selection. Unselected features have a zero value on the regression coefficient.
The results of using the HSIC Lasso method, which selects features based on correlation values, ranks the regression coefficient values in selecting relevant features. From the research Little et al. [7] dataset which consisted of 22 features, 8 features were selected using HSIC Lasso. The most important feature in this dataset is the feature of spread1 (variation of 1 frequency spread) and followed by spread2 (variation of 2 frequencies spread) which is shown in Figure 9. While other features are basic features of frequency such as DFA, MDVP:Fhi, MDVP:Fo, HNR, RPDE, and PPE. When compared with the results of feature selection on unbalanced data, the selected features in the Little et al. [7] dataset have the same type and number. This shows that the data imbalance does not affect the selected features in Little et al. [7] dataset.
The selected features consist of four subsets of time frequency features and 4 subsets of baseline voice features. The comparison graph between the selected features and the total features is shown in Figure 10. Most of the selected features are in the baseline subset which has a selected feature ratio of 4:17. While the time frequency feature subset has a selected feature ratio of 4:5. In the Little et al. [7] dataset the time frequency feature subset is very influential in the prediction model. However, it has not concluded that the subset of features is the most important feature in the vocal sound signal. It should be underlined that this condition only occurs based on the data collected in the Little et al. [7].
Meanwhile, in the Naranjo et al. [14] dataset which has 45 features, 18 similar features were selected in each evaluation using a different classification. The selected features are shown in Figure 11. The feature that has the highest similarity value is the HNR_35 feature (the amount of the harmonic to noise ratio in the 0–3, 500 Hz frequency range). While the remaining features have a large similarity value range with the HNR_35 feature. HNR_35 belongs to the baseline voice feature subset.
Of the 18 selected features, there are 14 subsets of the mel frequency feature and four subsets of the baseline voice feature. Figure 12 explains that from 26 subsets of selected mel frequency features 12 irrelevant and redundant features and from 19 selected baseline voice feature subsets 15 irrelevant and redundant features. Naranjo et al. [14] dataset mostly discard the baseline voice feature subset and make the mel frequency feature subset more important to use for prediction models.
In the Sakar et al. [8] dataset which has 754 features selected 50 features using HSIC Lasso. In contrast to other data, Sakar et al. [8] dataset has a subset of vocal fold, wavelet transform, and TQWT (tunable Q-factor wavelet transform) features. The four features that have the highest similarity values are tqwt_TKEO_std_dec_12 (standard deviation of the 12th level Teager-Kaiser energy operator), tqwt_kurtosisValue_dec_26 (quantity used to define distribution on signal level 26), tqwt_kurtosisValue_dec_20 (quantity which is used to define the signal distribution in the 20th level wavelet transform), tqwt_entropy_shanon_dec_12 (12th degree wavelet transform probability distribution).
If seen in Figure 13, the most selected features belong to the TQWT feature subset. The number of features selected from these features is 36 out of 432 features. While the other features consist of 6 subsets of the mel frequency feature, one subset of the baseline voice feature, three subsets of the vocal fold feature, one subset of the time frequency feature, and three subsets of the wavelet transform feature. The wavelet transform feature subset has a ratio of the difference between the selected features and the smallest total features of 1:182 compared to the TQWT feature subset of 1:12. This shows that Sakar et al. [8] dataset has an important subset of features namely TQWT.
The features selected after the feature selection stage will be used in the model setting stage for each classification method. The results of the adjustment are shown in Table 9. Performance evaluation on the prediction model uses the AUC parameter value. The setup uses the GridSearchCV method with a fold parameter value of 10. The number of class ratios in each fold is the same.
Then, the prediction model with the best hyper-parameters in each classification method is used for evaluation using testing data and evaluation techniques according to previous studies. Evaluation uses the parameters of accuracy, sensitivity, precision, F1, and AUC. In addition, there is also an ROC curve to determine a good classification method for the prediction model.
Table 10 shows that the HSIC Lasso SMOTE model produces better performance than without using oversampling and feature selection methods. The best accuracy performance results were obtained by the MLP classification method with a difference of 8.08% from the model without HSIC Lasso SMOTE. However, models that use the SVM classification result in lower accuracy performance, even though they have better AUC performance. The process of balancing the data using the SMOTE method also increases the accuracy performance by 5.50% in the MLP classification method. However, when compared to other classification methods, the SMOTE method does not have a significant effect.
Figure 14 shows the results of both ROC curves that are good for the k-NN and MLP classification methods. Especially in the MLP classification method which is worth 99.70%. While in the SVM classification method, the AUC value has a fairly far range from other classification methods. The left side of the curve is targeted at a class of normal individuals with an index of 0. On this curve, the SVM method tends to predict people with Parkinson’s disease better than normal individuals. Likewise, the left side of the curve targets the class of people with Parkinson’s disease with index 1. This ROC curve concludes that the MLP prediction model is better used for the prediction model.
In the Naranjo et al. [14] dataset no oversampling was carried out. This is because the dataset has balanced data in each class. As a result, the use of SMOTE oversampling did not have any impact on the dataset. The HSIC Lasso SMOTE prediction model on the Naranjo et al. [14] dataset resulted in better performance than without using the feature selection method (see Table 11). The best accuracy performance results were obtained by the MLP classification method with a difference of 8.67% from the model without HSIC Lasso SMOTE. While the k-NN method shows poor results and has a performance range that is quite far from other classification methods.
However, if you look at the ROC curve shown in Figure 15, the results of the AUC values in the three classification methods are not much different. The MLP method has the highest AUC value with a value of 88.33%. If we look at the two curves, the ability of the MLP classification method in predicting the class of people with Parkinson’s disease and normal individuals is the same. Then, the SVM method has an AUC value of 86.89%. Comparison of the ability to predict people with Parkinson’s disease with normal individuals. Likewise with the k-NN classification method with an AUC value of 83.03%, although it has the lowest accuracy performance compared to other classification methods. This concludes that the MLP method is also good for predicting models of the Naranjo et al. [14] dataset.
Meanwhile, Table 12 shows that the HSIC Lasso SMOTE model improves accuracy by 15.78% on k-NN, 18.26% on SVM, 15.06% on MLP from the model without using HSIC Lasso SMOTE. The best classification method was achieved by the MLP method. The comparison between the sensitivity and precision values in each classification method is not much different. This shows that the three methods are able to predict with balanced results. The data that has been balanced after the oversampling process using the SMOTE method also improves the performance of the prediction model.
While on the ROC curve the prediction model shows good results in the three classification methods. The method that produces the highest AUC value is the MLP method with a value of 99.22% (see Figure 16). It also concludes that the MLP method is good for predicting models in the Sakar et al. [8] dataset. The performance results of the three datasets produce good performance on the prediction model using the MLP classification method. The sensitivity and precision parameters have a value comparison that is not too far away and has a fairly high F1 parameter. Compared to other datasets, Naranjo et al. [14] dataset have a fairly low performance using the HSIC Lasso method or not.
From the previous experiment, HSIC Lasso provides better performance than the prediction model that does not use this method. Moreover, the use of SMOTE to balance the data in the Little et al. [7] dataset and Sakar et al. [8] dataset strengthens the HSIC Lasso method in evaluating data in the feature selection process. The author tries to compare the HSIC Lasso method with related studies based on the same dataset and using the feature selection method.
Compared to other studies, the Lasso HSIC prediction model produces fewer features and better accuracy performance in the Little et al. [7] dataset. Feature subset used in this study is contained in the feature subset used in the related study [16, 23, 24] (see Table 13). Although the features used are relatively the same, especially in the study of Ouhmida et al. [16], the classification results using the k-NN method produce lower accuracy. The k-NN method used as a predictive model in this study has been set with the best hyper-parameters. However, the MLP classification method is able to match the accuracy performance compared to other studies. Besides being compared to the same dataset, this comparison is based on evaluating the same predictive model using 10-fold cross-validation.
While in the Naranjo et al. [14] dataset, the HSIC Lasso method does not produce fewer features than other studies [17, 25] (see Table 14). The features used in other studies are of the same type and are contained in the subset of features used in this study. The method used in other studies is also the root of the HSIC Lasso method, namely the Lasso feature selection method. As explained earlier, the Lasso method does not transform data into kernel form. So there is no sparsity or data void that makes the correlation between features and output even greater. However, in the case of the Naranjo et al. [14] dataset which has 45 features, 240 samples, and balanced data classes, the HSIC Lasso model is not suitable for use. Although it is not well used, the MLP method produces better performance than other studies that use a smaller number of features. This concludes that the Lasso HSIC model in the case of the Naranjo et al. [14] dataset are not well used for detection systems. This dataset does not use the oversampling method, because the data is already balanced and has no effect if applied.
The results of the Lasso HSIC selection method on the Sakar et al. [8] dataset yields a non-distinguishable number of features based on the total features in the dataset (see Table 15). On the features used by research related to the Sakar et al. [8] dataset does not show in detail what features were selected. However, in the study of Sakar et al. [8] shows the distribution of features that have been grouped as in Table 2. Table 16 compares the distribution of features based on these groupings with the MRMR SVM-RBF model in the research of Sakar et al. [8] dataset.
Regardless of feature distribution, the proposed predictive model yields better performance than other studies. Especially in the HSIC Lasso SMOTE MLP method which produces a difference of 10.18% with the study of Sakar et al. [8] which has the same number of features. In this dataset, the use of the SMOTE method greatly affects the performance of the prediction model. When compared to prediction models that do not use SMOTE, the results of the prediction model performance are not much different from other studies. In every model variation that has been tested, HSIC Lasso always provides the best results in handling feature selection with very large amounts of data based on the accuracy of the predictions produced. Moreover, in every case that has been tested, HSIC Lasso can reduce features effectively. So that it is able to provide better performance with a minimal number of features.
Proposed method has proven successful in selecting vocal voice features and has a good performance in developing a Parkinson’s disease detection system. Likewise with the prediction model of related research, HSIC Lasso method can exceed the accuracy performance and produce fewer features in some conditions depending on the data tested from previous studies. Variation of data includes the number of samples and features. Number of samples comparison between two classes also affects HSIC Lasso calculation in determining the number of relevant and not redundant features. HSIC Lasso has proven to be more effective in building predictive models on balanced data with larger features and data samples.
HSIC Lasso feature selection method will be better used in high-dimensional data. High-dimensional data will make the data that has been transformed into a Hilbert Schmidt Kernel produce a stronger similarity value between the feature data and the output data. Then the lack of this study is that it does not focus on other aspect of relation between data feature and subject condition. The author generalizes the condition of the subject and ignores some features such age, gender, and the number of voice recordings on one subject. This is potentially important in evaluating human vocal voice data. Because the voices produced by people in different age or gender are very different.
Parkinson’s disease detection system flow.
Oversampling process using synthetic minority oversampling technique: (a) unbalanced dataset, (b) adding synthetic data from minority class, and (c) balanced dataset.
Lasso regression graph in case area where
HSIC Lasso method flow.
K-nearest neighbors method overview.
Support vector machine method overview.
Multilayer perceptron method overview.
ROC curve graph.
Selected feature score Little et al. [
Distribution selected feature Little et al. [
Selected feature score Naranjo et al. [
Distribution selected feature Naranjo et al. [
Distribution selected feature Sakar et al. [
ROC graph result: (a) targeting Parkinson’s disease class and (b) targeting normal class Sakar et al. [
ROC graph result: (a) targeting Parkinson’s disease class and (b) targeting normal class Naranjo et al. [
ROC graph result: (a) targeting Parkinson’s disease class and (b) targeting normal class Sakar et al. [
Table 2 . Feature subset grouping.
Number | Feature subset | Description |
---|---|---|
I | Voice Baseline | Basic value of sound extraction |
II | Vocal Fold | Vocal cords extraction |
III | Time Frequency | Sound frequency time extraction |
IV | Mel Frequency | Short-term power spectrum coefficient |
V | Wavelet Transform | Feature extraction from wavelet transform |
VI | Tunable Q-factor Wavelet Transform | Feature extraction from wavelet transform with Q-factor |
Table 3 . Hyper-parameter tuning.
Classification | Hyper-parameter |
---|---|
k-NN | |
SVM | C: [0.1, 0.5, 1.0], kernel: [rbf, poly, sigmoid], gamma: auto |
MLP | solver: [lbfgs, sgd, adam], hidden layer: [200, 300, 400], learning rate: [constant, invscaling, adaptive], iterasi: 3000, learning rate init: 0.0001 |
Table 9 . Hyper-parameter test result.
Dataset | Classification | Hyper-parameter | AUC (%) |
---|---|---|---|
Little et al. [7] | k-NN | 98.13 | |
SVM | C: 1.0, Kernel: poly, Gamma: auto | 93.59 | |
MLP | Solver: adam, Hidden layer: 400, Learning rate: adaptive, Iterasi: 3000 | 99.70 | |
Naranjo et al. [14] | k-NN | 86.03 | |
SVM | C: 0.5, Kernel: poly, Gamma: auto | 86.89 | |
MLP | Solver: adam, Hidden layer: 400, Learning rate: invscaling, Iterasi: 3000 | 88.33 | |
Sakar et al. [8] | k-NN | 96.59 | |
SVM | C: 0.1, Kernel: poly, Gamma: auto | 97.72 | |
MLP | Solver: adam, Hidden layer: 400, Learning rate: invscaling, Iterasi: 3000 | 99.22 |
Table 13 . Comparison with a number of previous studies Little et al. [7] dataset.
Studies | Method | Number of features | Accuracy (%) |
---|---|---|---|
Ouhmida et al. [16] | ReliefF + k-NN | 8 | 97.92 |
Thapa et al. [23] | Correlation-based + TSVM | 13 | 93.90 |
Ma et al. [24] | RFE + SVM | 11 | 96.29 |
This study | SMOTE + HSIC Lasso + k-NN | 8 | 97.61 |
SMOTE + HSIC Lasso + SVM | 8 | 96.93 | |
SMOTE + HSIC Lasso + MLP | 8 | 98.84 |
Parkinson’s disease detection system flow.
|@|~(^,^)~|@|Oversampling process using synthetic minority oversampling technique: (a) unbalanced dataset, (b) adding synthetic data from minority class, and (c) balanced dataset.
|@|~(^,^)~|@|Lasso regression graph in case area where
HSIC Lasso method flow.
|@|~(^,^)~|@|K-nearest neighbors method overview.
|@|~(^,^)~|@|Support vector machine method overview.
|@|~(^,^)~|@|Multilayer perceptron method overview.
|@|~(^,^)~|@|ROC curve graph.
|@|~(^,^)~|@|Selected feature score Little et al. [
Distribution selected feature Little et al. [
Selected feature score Naranjo et al. [
Distribution selected feature Naranjo et al. [
Distribution selected feature Sakar et al. [
ROC graph result: (a) targeting Parkinson’s disease class and (b) targeting normal class Sakar et al. [
ROC graph result: (a) targeting Parkinson’s disease class and (b) targeting normal class Naranjo et al. [
ROC graph result: (a) targeting Parkinson’s disease class and (b) targeting normal class Sakar et al. [