International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(2): 140-151
Published online June 25, 2023
https://doi.org/10.5391/IJFIS.2023.23.2.140
© The Korean Institute of Intelligent Systems
Mediana Aryuni1 , Suko Adiarto2 , Eka Miranda1 , Evaristus Didik Madyatmadja1 , Albert Verasius Dian Sano3 , and Elvin Sestomi1
1Department of Information Systems, Bina Nusantara University, Jakarta, Indonesia
2Department of Cardiology and Vascular Medicine, University of Indonesia, Jakarta, Indonesia
3Department of Computer Science, Bina Nusantara University, Jakarta, Indonesia
Correspondence to :
Mediana Aryuni (mediana.aryuni@binus.ac.id)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
In the field of medical data mining, imbalanced data categorization occurs frequently, which typically leads to classifiers with low predictive accuracy for the minority class. This study aims to construct a classifier model for imbalanced data using the SMOTE oversampling algorithm and a heart disease dataset obtained from Harapan Kita Hospital. The categorization model utilized logistic regression, decision tree, random forest, bagging logistic regression, and bagging decision tree. SMOTE improved the model prediction accuracy with imbalanced data, particularly for minority classes.
Keywords: Heart disease, Prediction, Imbalanced, Accuracy, SMOTE
According to the World Health Organization (WHO), 16% of all deaths worldwide are caused by heart disease. This disease has shown the highest increase in deaths since 2000, with a growth of more than 2 million deaths to 8.9 million deaths in 2019 [1]. Furthermore, according to the Institute for Health Metrics and Evaluation, the number of deaths per year caused by coronary heart disease in Indonesia is 245,343 [2].
As a result, increased prevention efforts are required, such as early detection of heart disease, which can be achieved through a doctor’s diagnosis. Alternative tools for predicting heart disease may provide solutions to these problems. For example, a predictive model was developed using machine learning-based predictive models. Therefore, it is crucial that heart disease prediction models are continuously improved for the purpose of developing assistive devices.
Categorization is the process of categorizing a specific set of data into classes. The process begins by predicting the class based on the provided data points. Classes are also referred to as targets, labels, or categories.
Categorization aims to create a rule that can be used to determine the category of new data based on a dataset with a known category (training set). Several categorization algorithms were established from the attribute values assessed for each dataset [3].
In the fields of medical pattern recognition and data mining, imbalanced data categorization frequently occurs [4] and is caused by various conditions in various sample collections [5].
Imbalanced data occurs when there are significantly more observations in one class than in the training data [6, 7]. When the distribution of classes is imbalanced, the categorization outcome is skewed, leading to mis-categorization [7]. The prediction accuracy of the minority data is affected by this condition [8].
Class imbalance has significant effects on the learning process, and typically leads to classifiers that have low predictive accuracy for the minority class and tend to classify the majority of new samples into the majority class. Therefore, it is crucial to evaluate the performance of classifiers [3]. Many common learning algorithms perform poorly in categorization tasks because of the class-imbalance problem [5, 6].
Numerous categorization techniques, including the feature selection algorithm, cost-sensitive learning algorithm, undersampling technique over the majority class, and oversampling strategy over the minority class, have been extensively researched for this purpose [4].
Extensive research has been conducted on categorization models for heart disease [9–14]. Their machine learning-based prediction model for heart disease did not include methods to address imbalanced data in heart disease categorization.
Previous studies [3–8, 15] have used the Synthetic Minority Oversampling Technique (SMOTE) algorithm to solve imbalanced data problems in categorization. The study [16] demonstrated an effective SMOTE for heart attack prediction. They used the UCI dataset to train machine-learning algorithms to predict heart attacks. In this study [17], the SMOTE algorithm performs slightly better than the random oversampling technique in most cases.
The objective of this study is to develop a classifier model for heart disease categorization. Five models were created for this purpose: logistic regression (LR), decision tree (DT), random forest (RF), bagging logistic regression, and bagging decision tree. The dataset was obtained from Harapan Kita Hospital’s electronic health records in Jakarta, Indonesia. To handle disproportionate data, which is common in the healthcare sector, we used the SMOTE algorithm to improve the minority class prediction accuracy.
This paper is organized as follows. The research background and objective are discussed in Section 1. Related research is listed in Section 2. In Sections 3 and 4 discusses the proposed methods for categorizing heart disease. Section 5 explains the experimental results, analysis, and evaluation. Section 6 concludes the report by outlining the future research objectives.
We reviewed relevant literature for this study. Numerous machine-learning techniques have been employed to forecast heart disease, including LR [9–11], DT [9, 10], RF [9–13], and bagging [14]. Muhammad et al. [9] developed heart disease prediction methods using LR, support vector machine (SVM), k-nearest neighbors (KNN), DT, RF, and gradient boosting (GB). With the use of 6,553 patients’ electronic medical data, the study [10] built six machine-learning models (LR, RF, DT, XGBoost, Gaussian Naive Bayes, and KNN) for the prediction of asymptomatic carotid atherosclerosis in China.
The LR, RF, GB, and XGBoost algorithms were used in [11] to predict atherosclerotic cardiovascular disease (ASCVD) in Asian or Hispanic patients. Using data from the retrospective study and statistical analysis for atherosclerosis prediction, the study [12] applied a random forest for atherosclerosis in China. Based on data from a retrospective analysis of 3,302 Korean patients, one study [13] developed multiple machine-learning models (CART decision tree and RF) to predict the existence of coronary artery calcification.
The study [14] suggested using a bagging algorithm to identify heart disease warning symptoms in patients and compared the bagging algorithm’s performance with the DT method.
The dataset was collected from the Harapan Kita Hospital’s electronic health records in Jakarta, Indonesia. It comprises 108,191 in-patient medical records of heart disease (International Classification of Diseases [ICD] code I25.1), with two categories of diagnosis of how the patient left the hospital: (1) allowed to go home or referred to another hospital or no description and (2) died in less than 48 hours or died more than 48 hours.
Data preparation, often known as data preprocessing, involves preparing data for further processing. As we are aware, data preparation includes preprocessing. This stage produces reliable data analytics for the operational data. These phases include quality evaluation, feature selection, data cleaning, and data transformation [18]. After data preprocessing, the final stage, before using the data to train the model, balanced the minority class data using SMOTE.
Figure 1 illustrates the data preprocessing steps. From the electronic health records, we accessed the heart disease medical record and laboratory test tables. There were 65 attributes of medical records in the table: (1) patient information, covers: registration_date, return_date, patient_name, medical record code, age, room_code and laboratory test code; doctor name; (2) icd code and icd code description; (3) icd_sk (International Subclassifications of Diseases); (4) 9 cm_sk (International Classification of Medical Procedures), but icd_sk and 9 cm_sk were removed due to the column which is mostly empty; (5) the patient’s condition when they leave the hospital and how they leave: is it at their own request, are they permitted to go home, are they referred to another hospital, or no description; and they died in less than 48 hours or more than 48 hours. Instances with empty attributes were also removed.
The next process involved combining the medical records and laboratory test tables. The laboratory test table has 58 attributes that store blood test results, but not all tests are performed; therefore, several attributes are left empty. The laboratory test only requires testing for khermchc, hermch, trombosit, leukosit, hemoglobin, eritrosit, and hematokrit. We obtained a dataset with 4,691 rows.
The next step was the feature selection process, which is a process of deciding which attributes are most suitable for the analysis to reduce bias when using the data by removing attributes that are not relevant. After screening and discussions with a medical expert, we used nine attributes and one class label (class target) based on a relevant test. Table 1 lists the descriptions of the attributes.
From the data preprocessing and data analysis, we discovered eight numerical attributes, including age, eritrosit, hematokrit, hemoglobin, hermch, khermchc, leukosit, and trombosit, and two categorical attributes, sex and diagnosis (class target). For the class target(diagnosis), there were two categories of how heart disease patients left the hospital: (1) allowed by the hospital to leave or referred to another hospital and (2) the patient died less or more than 48 hours. The doctor determines how the patient should be discharged from the hospital and uses these states as class labels. As a result, our dataset included ground-truth labels based on the opinions of medical experts.
This model provides a mathematical representation to recognize a certain type of pattern. We trained a model with a dataset and provided it with an algorithm that can be used to reason and learn from the data. In this study, we used a categorization model. The model simply labels samples as an excellent score repetitively 100% accurate, but it cannot predict well on data that have not been observed; this is called overfitting.
To prevent this type of problem, there is a common best practice of dividing the dataset into two sets: training and testing sets for training any machine-learning model, without considering the nature of the dataset used. However, before that, we balanced the data with the help of the imblearn.over_sampling. SMOTE library to balance the dataset for each targeted class. After the data were balanced, the dataset was split into two or four datasets: X-train, Y-train, X-test, and Y-test. The training set is used to estimate different parameters or to compare different model performances, and the test set is used to examine the model [19].
The SMOTE approach uses a uniform probability distribution and KNN to produce artificial data. The operating principles are as follows. The procedure begins by dividing the data provided into majority and minority classes. The KNN technique then determines the k number of neighbors of each minority class data. Each minority sample used to create the synthetic samples had a unique nearest neighbor among the KNN chosen at random. Next, we determine
where via random selection from a uniform probability distribution,
where a random value from the uniform probability distribution is denoted as
We split both the balanced and imbalanced datasets using the holdout method, which typically uses 80% of the data as the training set and the remaining 20% as the test set. The dataset was divided into two random sets with the training set accounting for 80% of both the imbalanced dataset (4,219 rows) and balanced dataset (6,750 rows), and the test set accounting for 20% of both the imbalanced (472 rows) and balanced datasets (1,688 rows) using sklearn.model_selection.train_test_split in Python. We also converted the categorical attribute value namely gender to ‘1’ for male and ‘0’ for female, for a patient who died in hospital within 48 hours or above 48 hours with 1. The patient with the state went home or was discharged at their own request or finished observation or transferred to another hospital or had no information with 0. To find the most suitable models for use with the dataset, we experimented with LR, DT, RF, and bagging LR and DT. A flowchart of the model development is shown in Figure 2.
The models were then evaluated using a variety of relative matrices, such as the confusion matrix, which measures the models based on the true positive value (TP), true negative value (TN), false positive (FP), and false negative (FN) by importing the sklearn.metrics.confusion_matrix library, and a categorization model report that consists of the accuracy score (TP and TN divided by FP and FN), precision, F1-score, and recall by importing sklearn.metrics.classification_report. Because the base data were imbalanced and then balanced with SMOTE, we decided not to use the confusion matrix; instead, we used another matrix that is good at examining models with imbalanced data, such as precision, recall, and F1-score. Precision scores were calculated by dividing the TPs by the total TP and TN. The recall was calculated by dividing the total TPs by the total TPs and FNs. The F1-score is a combination of precision and recall and is a type of harmonic mean precision and recall score [20, 21].
As stated above, the confusion matrix is used to check the quality of recall and precision of a model, and the highest value of this matrix is 1.0, and the lowest is 0.0, the higher the F1-score, the better the model. In this categorization model, an undesirable FN value could cause the patient to fail to receive proper treatment and endanger the patient’s life. As we can conclude from the formula, the precision score measures the extent of error caused by FPs, whereas recall measures the extent of error caused by FNs, which means that in this case, it is more important. Therefore, the recall measurement score was monitored over to handle the unwanted false negative predictions. If the recall score is already near 1, then we can continue the precision and use the F1-score to improve both the recall and precision scores.
We also used the ROC AUC curve [22], which is believed to be more convenient because it employs the probability of prediction where F1 is not capable to do so. The AUC-ROC is a matrix used to measure performance, ROC is a probability curve, and AUC represents the degree of separability. Figure 3 shows an example of an ROC AUC curve.
The AUC-ROC measures how capable a model can distinguish between TPs and TNs, when it is above 5, it is believed to be able to distinguish as it gets higher until the ROC curve reaches 1, which means that the model is capable of predicting class 1 as 1 and 0 as 0, and vice versa when the ROC reaches 0, which means the model predicts negative classes, which means that class 1 was predicted as 0 and 0 as 1, and when the ROC was at 0.5, the model could not distinguish between the positive true and false values. A good model had an ROC above the AUC score, and a poor model had an ROC is near 0.5 or near 0. The ROC AUC can be increased by defining the weight of each class when tuning the hyperparameters of the model or balancing the dataset. yellowbrick.classifier.rocauc.roc_auc library and matplotlib.pyplot was added to calculate and display the curve [18].
For the balancing dataset experiment using the SMOTE algorithm, the class_weight parameter was set to “balanced.” For each class, the weight is defined by
The dataset that was balanced with SMOTE library which the class_weight = “balanced” was set in the hyper parameter in each model that was fitted with the SMOTE balanced dataset [23].
The experiment is described below.
From the experiments which was conducted, we evaluated and compared the performance of each model. Tables 2 and 3 show the performance of each model with and without SMOTE.
In most measurements of each model, the SMOTE algorithm enhanced the performance of the model, except for the precision of bagging LR and DT and the AUC of bagging DT. The precision, recall, F1-score, and AUC of LR, DT, and RF with the SMOTE algorithm were slightly better than those without the algorithm. The recall, F1-score, and AUC of bagging LR showed significant improvement when applying the SMOTE algorithm. In the bagging DT, SMOTE algorithm enhanced the recall and F1-score. Bagging DT outperformed in precision and AUC without the SMOTE algorithm and surpassed the recall and F1-score with the SMOTE algorithm.
Based on [24–26], SMOTE enhances the recall value. Our 5-classifier model also attained a recall enhancement by applying the SMOTE algorithm. With an improvement in recall, the false-negative values decreased.
Figure 4 shows the ROC curves of the five classifier models without the SMOTE algorithm. The ROC curves of the five classifier models obtained using the SMOTE algorithm are shown in Figure 5. The bagging DT surpassed both with and without the SMOTE algorithm.
Table 4 lists the performance of each model before and after the SMOTE algorithm was used to predict the minority class. SMOTE significantly enhanced the precision and F1-score of all the models.
This is the first study to present a systematic comparison of oversampling algorithms for heart disease categorization using data from Harapan Kita Hospital’s electronic heath records in Jakarta, Indonesia. In comparison with [16], we obtained factual results because we used data directly from hospitals.
Previous research has confirmed our findings, indicating that the validity of our SMOTE approach for learning from imbalanced data has proven to be successful in a wide range of applications across multiple domains. SMOTE inspired several approaches to address the issue of class imbalance and made significant contributions to supervised learning paradigms [6]. In the medical field, the classification of imbalanced data using the SMOTE algorithm attains a performance superior to that of several chest X-ray image databases in the automatic computerized detection of pulmonary [4]. In addition, research on the expansion and classification of imbalanced data based on the SMOTE algorithm for the imbalanced datasets of Pima, WDBC, WPBC, Ionosphere, and Breast-cancer-Wisconsin shows that the classification effect is better [5]. Compared to [4] and [5], our research improves the process by comparing experiments with SMOTE to experiments without SMOTE for each algorithm: LR, DT, RF, bagging LR, and bagging DT. As a result, we were able to investigate the effect of SMOTE on each algorithm.
Finally, this study contributes to the ability of machine-learning-based models to categorize imbalanced data using SMOTE, thereby improving model prediction accuracy, particularly for minority classes. This result contributes to the medical field where imbalanced data are frequently encountered.
Categorization models using SMOTE for predicting heart disease from Harapan Kita Hospital’s electronic health records were used in the study. These models include LR, DT, RF, bagging LR, and bagging DT, which have improved the model prediction accuracy, particularly for minority classes. For future research, we intend to explore ensemble methods and deep learning to optimize the prediction accuracy.
No potential conflicts of interest relevant to this article were reported.
Table 1. Attribute description.
Attribute name | Description |
---|---|
male or female | |
age in years | |
the number of red blood cells in the patient’s blood vessels | |
percentage of red blood cells in blood | |
the amount of oxygen and iron-carrying protein in the patient’s body | |
the average mass of hemoglobin per red blood cell in a blood sample from a patient | |
the patient’s average hemoglobin concentration per red blood cell | |
white blood cell number in the patient’s blood vessels | |
the number of platelets in the patient’s blood for blood clotting | |
class label (class target) |
Table 2. Performance of each model without SMOTE algorithm.
Model | Performance measurement | |||
---|---|---|---|---|
Precision (%) | Recall (%) | F1-score (%) | AUC (%) | |
47 | 50 | 49 | 64 | |
70 | 60 | 63 | 62 | |
71 | 58 | 61 | 70 | |
72 51 | 50 | 69 | ||
98 | 66 | 73 | 97 |
Table 3. Performance of each model with SMOTE algorithm.
Model | Performance measurement | |||
---|---|---|---|---|
Precision (%) | Recall (%) | F1-score (%) | AUC (%) | |
66 | 66 | 66 | 72 | |
76 | 75 | 75 | 82 | |
78 | 78 | 78 | 87 | |
66 | 66 | 66 | 73 | |
83 | 83 | 83 | 92 |
Table 4. Performance of each model before and after using SMOTE algorithm in predicting minority class.
Model | Performance measurement using SMOTE | |||||
---|---|---|---|---|---|---|
Precision (%) | Recall (%) | F1-score (%) | ||||
Before | After | Before | After | Before | After | |
9 | 66 | 67 | 67 | 15 | 66 | |
10 | 70 | 81 | 84 | 18 | 76 | |
14 | 75 | 82 | 81 | 24 | 78 | |
9 | 66 | 67 | 67 | 15 | 66 | |
19 | 81 | 76 | 76 | 30 | 78 |
International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(2): 140-151
Published online June 25, 2023 https://doi.org/10.5391/IJFIS.2023.23.2.140
Copyright © The Korean Institute of Intelligent Systems.
Mediana Aryuni1 , Suko Adiarto2 , Eka Miranda1 , Evaristus Didik Madyatmadja1 , Albert Verasius Dian Sano3 , and Elvin Sestomi1
1Department of Information Systems, Bina Nusantara University, Jakarta, Indonesia
2Department of Cardiology and Vascular Medicine, University of Indonesia, Jakarta, Indonesia
3Department of Computer Science, Bina Nusantara University, Jakarta, Indonesia
Correspondence to:Mediana Aryuni (mediana.aryuni@binus.ac.id)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
In the field of medical data mining, imbalanced data categorization occurs frequently, which typically leads to classifiers with low predictive accuracy for the minority class. This study aims to construct a classifier model for imbalanced data using the SMOTE oversampling algorithm and a heart disease dataset obtained from Harapan Kita Hospital. The categorization model utilized logistic regression, decision tree, random forest, bagging logistic regression, and bagging decision tree. SMOTE improved the model prediction accuracy with imbalanced data, particularly for minority classes.
Keywords: Heart disease, Prediction, Imbalanced, Accuracy, SMOTE
According to the World Health Organization (WHO), 16% of all deaths worldwide are caused by heart disease. This disease has shown the highest increase in deaths since 2000, with a growth of more than 2 million deaths to 8.9 million deaths in 2019 [1]. Furthermore, according to the Institute for Health Metrics and Evaluation, the number of deaths per year caused by coronary heart disease in Indonesia is 245,343 [2].
As a result, increased prevention efforts are required, such as early detection of heart disease, which can be achieved through a doctor’s diagnosis. Alternative tools for predicting heart disease may provide solutions to these problems. For example, a predictive model was developed using machine learning-based predictive models. Therefore, it is crucial that heart disease prediction models are continuously improved for the purpose of developing assistive devices.
Categorization is the process of categorizing a specific set of data into classes. The process begins by predicting the class based on the provided data points. Classes are also referred to as targets, labels, or categories.
Categorization aims to create a rule that can be used to determine the category of new data based on a dataset with a known category (training set). Several categorization algorithms were established from the attribute values assessed for each dataset [3].
In the fields of medical pattern recognition and data mining, imbalanced data categorization frequently occurs [4] and is caused by various conditions in various sample collections [5].
Imbalanced data occurs when there are significantly more observations in one class than in the training data [6, 7]. When the distribution of classes is imbalanced, the categorization outcome is skewed, leading to mis-categorization [7]. The prediction accuracy of the minority data is affected by this condition [8].
Class imbalance has significant effects on the learning process, and typically leads to classifiers that have low predictive accuracy for the minority class and tend to classify the majority of new samples into the majority class. Therefore, it is crucial to evaluate the performance of classifiers [3]. Many common learning algorithms perform poorly in categorization tasks because of the class-imbalance problem [5, 6].
Numerous categorization techniques, including the feature selection algorithm, cost-sensitive learning algorithm, undersampling technique over the majority class, and oversampling strategy over the minority class, have been extensively researched for this purpose [4].
Extensive research has been conducted on categorization models for heart disease [9–14]. Their machine learning-based prediction model for heart disease did not include methods to address imbalanced data in heart disease categorization.
Previous studies [3–8, 15] have used the Synthetic Minority Oversampling Technique (SMOTE) algorithm to solve imbalanced data problems in categorization. The study [16] demonstrated an effective SMOTE for heart attack prediction. They used the UCI dataset to train machine-learning algorithms to predict heart attacks. In this study [17], the SMOTE algorithm performs slightly better than the random oversampling technique in most cases.
The objective of this study is to develop a classifier model for heart disease categorization. Five models were created for this purpose: logistic regression (LR), decision tree (DT), random forest (RF), bagging logistic regression, and bagging decision tree. The dataset was obtained from Harapan Kita Hospital’s electronic health records in Jakarta, Indonesia. To handle disproportionate data, which is common in the healthcare sector, we used the SMOTE algorithm to improve the minority class prediction accuracy.
This paper is organized as follows. The research background and objective are discussed in Section 1. Related research is listed in Section 2. In Sections 3 and 4 discusses the proposed methods for categorizing heart disease. Section 5 explains the experimental results, analysis, and evaluation. Section 6 concludes the report by outlining the future research objectives.
We reviewed relevant literature for this study. Numerous machine-learning techniques have been employed to forecast heart disease, including LR [9–11], DT [9, 10], RF [9–13], and bagging [14]. Muhammad et al. [9] developed heart disease prediction methods using LR, support vector machine (SVM), k-nearest neighbors (KNN), DT, RF, and gradient boosting (GB). With the use of 6,553 patients’ electronic medical data, the study [10] built six machine-learning models (LR, RF, DT, XGBoost, Gaussian Naive Bayes, and KNN) for the prediction of asymptomatic carotid atherosclerosis in China.
The LR, RF, GB, and XGBoost algorithms were used in [11] to predict atherosclerotic cardiovascular disease (ASCVD) in Asian or Hispanic patients. Using data from the retrospective study and statistical analysis for atherosclerosis prediction, the study [12] applied a random forest for atherosclerosis in China. Based on data from a retrospective analysis of 3,302 Korean patients, one study [13] developed multiple machine-learning models (CART decision tree and RF) to predict the existence of coronary artery calcification.
The study [14] suggested using a bagging algorithm to identify heart disease warning symptoms in patients and compared the bagging algorithm’s performance with the DT method.
The dataset was collected from the Harapan Kita Hospital’s electronic health records in Jakarta, Indonesia. It comprises 108,191 in-patient medical records of heart disease (International Classification of Diseases [ICD] code I25.1), with two categories of diagnosis of how the patient left the hospital: (1) allowed to go home or referred to another hospital or no description and (2) died in less than 48 hours or died more than 48 hours.
Data preparation, often known as data preprocessing, involves preparing data for further processing. As we are aware, data preparation includes preprocessing. This stage produces reliable data analytics for the operational data. These phases include quality evaluation, feature selection, data cleaning, and data transformation [18]. After data preprocessing, the final stage, before using the data to train the model, balanced the minority class data using SMOTE.
Figure 1 illustrates the data preprocessing steps. From the electronic health records, we accessed the heart disease medical record and laboratory test tables. There were 65 attributes of medical records in the table: (1) patient information, covers: registration_date, return_date, patient_name, medical record code, age, room_code and laboratory test code; doctor name; (2) icd code and icd code description; (3) icd_sk (International Subclassifications of Diseases); (4) 9 cm_sk (International Classification of Medical Procedures), but icd_sk and 9 cm_sk were removed due to the column which is mostly empty; (5) the patient’s condition when they leave the hospital and how they leave: is it at their own request, are they permitted to go home, are they referred to another hospital, or no description; and they died in less than 48 hours or more than 48 hours. Instances with empty attributes were also removed.
The next process involved combining the medical records and laboratory test tables. The laboratory test table has 58 attributes that store blood test results, but not all tests are performed; therefore, several attributes are left empty. The laboratory test only requires testing for khermchc, hermch, trombosit, leukosit, hemoglobin, eritrosit, and hematokrit. We obtained a dataset with 4,691 rows.
The next step was the feature selection process, which is a process of deciding which attributes are most suitable for the analysis to reduce bias when using the data by removing attributes that are not relevant. After screening and discussions with a medical expert, we used nine attributes and one class label (class target) based on a relevant test. Table 1 lists the descriptions of the attributes.
From the data preprocessing and data analysis, we discovered eight numerical attributes, including age, eritrosit, hematokrit, hemoglobin, hermch, khermchc, leukosit, and trombosit, and two categorical attributes, sex and diagnosis (class target). For the class target(diagnosis), there were two categories of how heart disease patients left the hospital: (1) allowed by the hospital to leave or referred to another hospital and (2) the patient died less or more than 48 hours. The doctor determines how the patient should be discharged from the hospital and uses these states as class labels. As a result, our dataset included ground-truth labels based on the opinions of medical experts.
This model provides a mathematical representation to recognize a certain type of pattern. We trained a model with a dataset and provided it with an algorithm that can be used to reason and learn from the data. In this study, we used a categorization model. The model simply labels samples as an excellent score repetitively 100% accurate, but it cannot predict well on data that have not been observed; this is called overfitting.
To prevent this type of problem, there is a common best practice of dividing the dataset into two sets: training and testing sets for training any machine-learning model, without considering the nature of the dataset used. However, before that, we balanced the data with the help of the imblearn.over_sampling. SMOTE library to balance the dataset for each targeted class. After the data were balanced, the dataset was split into two or four datasets: X-train, Y-train, X-test, and Y-test. The training set is used to estimate different parameters or to compare different model performances, and the test set is used to examine the model [19].
The SMOTE approach uses a uniform probability distribution and KNN to produce artificial data. The operating principles are as follows. The procedure begins by dividing the data provided into majority and minority classes. The KNN technique then determines the k number of neighbors of each minority class data. Each minority sample used to create the synthetic samples had a unique nearest neighbor among the KNN chosen at random. Next, we determine
where via random selection from a uniform probability distribution,
where a random value from the uniform probability distribution is denoted as
We split both the balanced and imbalanced datasets using the holdout method, which typically uses 80% of the data as the training set and the remaining 20% as the test set. The dataset was divided into two random sets with the training set accounting for 80% of both the imbalanced dataset (4,219 rows) and balanced dataset (6,750 rows), and the test set accounting for 20% of both the imbalanced (472 rows) and balanced datasets (1,688 rows) using sklearn.model_selection.train_test_split in Python. We also converted the categorical attribute value namely gender to ‘1’ for male and ‘0’ for female, for a patient who died in hospital within 48 hours or above 48 hours with 1. The patient with the state went home or was discharged at their own request or finished observation or transferred to another hospital or had no information with 0. To find the most suitable models for use with the dataset, we experimented with LR, DT, RF, and bagging LR and DT. A flowchart of the model development is shown in Figure 2.
The models were then evaluated using a variety of relative matrices, such as the confusion matrix, which measures the models based on the true positive value (TP), true negative value (TN), false positive (FP), and false negative (FN) by importing the sklearn.metrics.confusion_matrix library, and a categorization model report that consists of the accuracy score (TP and TN divided by FP and FN), precision, F1-score, and recall by importing sklearn.metrics.classification_report. Because the base data were imbalanced and then balanced with SMOTE, we decided not to use the confusion matrix; instead, we used another matrix that is good at examining models with imbalanced data, such as precision, recall, and F1-score. Precision scores were calculated by dividing the TPs by the total TP and TN. The recall was calculated by dividing the total TPs by the total TPs and FNs. The F1-score is a combination of precision and recall and is a type of harmonic mean precision and recall score [20, 21].
As stated above, the confusion matrix is used to check the quality of recall and precision of a model, and the highest value of this matrix is 1.0, and the lowest is 0.0, the higher the F1-score, the better the model. In this categorization model, an undesirable FN value could cause the patient to fail to receive proper treatment and endanger the patient’s life. As we can conclude from the formula, the precision score measures the extent of error caused by FPs, whereas recall measures the extent of error caused by FNs, which means that in this case, it is more important. Therefore, the recall measurement score was monitored over to handle the unwanted false negative predictions. If the recall score is already near 1, then we can continue the precision and use the F1-score to improve both the recall and precision scores.
We also used the ROC AUC curve [22], which is believed to be more convenient because it employs the probability of prediction where F1 is not capable to do so. The AUC-ROC is a matrix used to measure performance, ROC is a probability curve, and AUC represents the degree of separability. Figure 3 shows an example of an ROC AUC curve.
The AUC-ROC measures how capable a model can distinguish between TPs and TNs, when it is above 5, it is believed to be able to distinguish as it gets higher until the ROC curve reaches 1, which means that the model is capable of predicting class 1 as 1 and 0 as 0, and vice versa when the ROC reaches 0, which means the model predicts negative classes, which means that class 1 was predicted as 0 and 0 as 1, and when the ROC was at 0.5, the model could not distinguish between the positive true and false values. A good model had an ROC above the AUC score, and a poor model had an ROC is near 0.5 or near 0. The ROC AUC can be increased by defining the weight of each class when tuning the hyperparameters of the model or balancing the dataset. yellowbrick.classifier.rocauc.roc_auc library and matplotlib.pyplot was added to calculate and display the curve [18].
For the balancing dataset experiment using the SMOTE algorithm, the class_weight parameter was set to “balanced.” For each class, the weight is defined by
The dataset that was balanced with SMOTE library which the class_weight = “balanced” was set in the hyper parameter in each model that was fitted with the SMOTE balanced dataset [23].
The experiment is described below.
From the experiments which was conducted, we evaluated and compared the performance of each model. Tables 2 and 3 show the performance of each model with and without SMOTE.
In most measurements of each model, the SMOTE algorithm enhanced the performance of the model, except for the precision of bagging LR and DT and the AUC of bagging DT. The precision, recall, F1-score, and AUC of LR, DT, and RF with the SMOTE algorithm were slightly better than those without the algorithm. The recall, F1-score, and AUC of bagging LR showed significant improvement when applying the SMOTE algorithm. In the bagging DT, SMOTE algorithm enhanced the recall and F1-score. Bagging DT outperformed in precision and AUC without the SMOTE algorithm and surpassed the recall and F1-score with the SMOTE algorithm.
Based on [24–26], SMOTE enhances the recall value. Our 5-classifier model also attained a recall enhancement by applying the SMOTE algorithm. With an improvement in recall, the false-negative values decreased.
Figure 4 shows the ROC curves of the five classifier models without the SMOTE algorithm. The ROC curves of the five classifier models obtained using the SMOTE algorithm are shown in Figure 5. The bagging DT surpassed both with and without the SMOTE algorithm.
Table 4 lists the performance of each model before and after the SMOTE algorithm was used to predict the minority class. SMOTE significantly enhanced the precision and F1-score of all the models.
This is the first study to present a systematic comparison of oversampling algorithms for heart disease categorization using data from Harapan Kita Hospital’s electronic heath records in Jakarta, Indonesia. In comparison with [16], we obtained factual results because we used data directly from hospitals.
Previous research has confirmed our findings, indicating that the validity of our SMOTE approach for learning from imbalanced data has proven to be successful in a wide range of applications across multiple domains. SMOTE inspired several approaches to address the issue of class imbalance and made significant contributions to supervised learning paradigms [6]. In the medical field, the classification of imbalanced data using the SMOTE algorithm attains a performance superior to that of several chest X-ray image databases in the automatic computerized detection of pulmonary [4]. In addition, research on the expansion and classification of imbalanced data based on the SMOTE algorithm for the imbalanced datasets of Pima, WDBC, WPBC, Ionosphere, and Breast-cancer-Wisconsin shows that the classification effect is better [5]. Compared to [4] and [5], our research improves the process by comparing experiments with SMOTE to experiments without SMOTE for each algorithm: LR, DT, RF, bagging LR, and bagging DT. As a result, we were able to investigate the effect of SMOTE on each algorithm.
Finally, this study contributes to the ability of machine-learning-based models to categorize imbalanced data using SMOTE, thereby improving model prediction accuracy, particularly for minority classes. This result contributes to the medical field where imbalanced data are frequently encountered.
Categorization models using SMOTE for predicting heart disease from Harapan Kita Hospital’s electronic health records were used in the study. These models include LR, DT, RF, bagging LR, and bagging DT, which have improved the model prediction accuracy, particularly for minority classes. For future research, we intend to explore ensemble methods and deep learning to optimize the prediction accuracy.
Data preprocessing.
Model development.
ROC AOC curve example [
ROC curves of five classifier models without SMOTE algorithm: (a) LR, (b) DT, (c) RF, (d) bagging LR, (e) bagging DT.
ROC curves of five classifier models with SMOTE algorithm: (a) LR, (b) DT, (c) RF, (d) bagging LR, (e) bagging DT.
Table 1 . Attribute description.
Attribute name | Description |
---|---|
male or female | |
age in years | |
the number of red blood cells in the patient’s blood vessels | |
percentage of red blood cells in blood | |
the amount of oxygen and iron-carrying protein in the patient’s body | |
the average mass of hemoglobin per red blood cell in a blood sample from a patient | |
the patient’s average hemoglobin concentration per red blood cell | |
white blood cell number in the patient’s blood vessels | |
the number of platelets in the patient’s blood for blood clotting | |
class label (class target) |
Table 2 . Performance of each model without SMOTE algorithm.
Model | Performance measurement | |||
---|---|---|---|---|
Precision (%) | Recall (%) | F1-score (%) | AUC (%) | |
47 | 50 | 49 | 64 | |
70 | 60 | 63 | 62 | |
71 | 58 | 61 | 70 | |
72 51 | 50 | 69 | ||
98 | 66 | 73 | 97 |
Table 3 . Performance of each model with SMOTE algorithm.
Model | Performance measurement | |||
---|---|---|---|---|
Precision (%) | Recall (%) | F1-score (%) | AUC (%) | |
66 | 66 | 66 | 72 | |
76 | 75 | 75 | 82 | |
78 | 78 | 78 | 87 | |
66 | 66 | 66 | 73 | |
83 | 83 | 83 | 92 |
Table 4 . Performance of each model before and after using SMOTE algorithm in predicting minority class.
Model | Performance measurement using SMOTE | |||||
---|---|---|---|---|---|---|
Precision (%) | Recall (%) | F1-score (%) | ||||
Before | After | Before | After | Before | After | |
9 | 66 | 67 | 67 | 15 | 66 | |
10 | 70 | 81 | 84 | 18 | 76 | |
14 | 75 | 82 | 81 | 24 | 78 | |
9 | 66 | 67 | 67 | 15 | 66 | |
19 | 81 | 76 | 76 | 30 | 78 |
Amirthalakshmi Thirumalai Maadapoosi, Velan Balamurugan, V. Vedanarayanan, Sahaya Anselin Nisha, and R. Narmadha
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(3): 231-241 https://doi.org/10.5391/IJFIS.2024.24.3.231Hansoo Lee, Jonggeun Kim, and Sungshin Kim
Int. J. Fuzzy Log. Intell. Syst. 2017; 17(4): 229-234 https://doi.org/10.5391/IJFIS.2017.17.4.229Data preprocessing.
|@|~(^,^)~|@|Model development.
|@|~(^,^)~|@|ROC AOC curve example [
ROC curves of five classifier models without SMOTE algorithm: (a) LR, (b) DT, (c) RF, (d) bagging LR, (e) bagging DT.
|@|~(^,^)~|@|ROC curves of five classifier models with SMOTE algorithm: (a) LR, (b) DT, (c) RF, (d) bagging LR, (e) bagging DT.