Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(2): 140-151

Published online June 25, 2023

https://doi.org/10.5391/IJFIS.2023.23.2.140

© The Korean Institute of Intelligent Systems

Imbalanced Learning in Heart Disease Categorization: Improving Minority Class Prediction Accuracy Using the SMOTE Algorithm

Mediana Aryuni1 , Suko Adiarto2 , Eka Miranda1 , Evaristus Didik Madyatmadja1 , Albert Verasius Dian Sano3 , and Elvin Sestomi1

1Department of Information Systems, Bina Nusantara University, Jakarta, Indonesia
2Department of Cardiology and Vascular Medicine, University of Indonesia, Jakarta, Indonesia
3Department of Computer Science, Bina Nusantara University, Jakarta, Indonesia

Correspondence to :
Mediana Aryuni (mediana.aryuni@binus.ac.id)

Received: December 5, 2022; Revised: April 6, 2023; Accepted: May 22, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

In the field of medical data mining, imbalanced data categorization occurs frequently, which typically leads to classifiers with low predictive accuracy for the minority class. This study aims to construct a classifier model for imbalanced data using the SMOTE oversampling algorithm and a heart disease dataset obtained from Harapan Kita Hospital. The categorization model utilized logistic regression, decision tree, random forest, bagging logistic regression, and bagging decision tree. SMOTE improved the model prediction accuracy with imbalanced data, particularly for minority classes.

Keywords: Heart disease, Prediction, Imbalanced, Accuracy, SMOTE

According to the World Health Organization (WHO), 16% of all deaths worldwide are caused by heart disease. This disease has shown the highest increase in deaths since 2000, with a growth of more than 2 million deaths to 8.9 million deaths in 2019 [1]. Furthermore, according to the Institute for Health Metrics and Evaluation, the number of deaths per year caused by coronary heart disease in Indonesia is 245,343 [2].

As a result, increased prevention efforts are required, such as early detection of heart disease, which can be achieved through a doctor’s diagnosis. Alternative tools for predicting heart disease may provide solutions to these problems. For example, a predictive model was developed using machine learning-based predictive models. Therefore, it is crucial that heart disease prediction models are continuously improved for the purpose of developing assistive devices.

Categorization is the process of categorizing a specific set of data into classes. The process begins by predicting the class based on the provided data points. Classes are also referred to as targets, labels, or categories.

Categorization aims to create a rule that can be used to determine the category of new data based on a dataset with a known category (training set). Several categorization algorithms were established from the attribute values assessed for each dataset [3].

In the fields of medical pattern recognition and data mining, imbalanced data categorization frequently occurs [4] and is caused by various conditions in various sample collections [5].

Imbalanced data occurs when there are significantly more observations in one class than in the training data [6, 7]. When the distribution of classes is imbalanced, the categorization outcome is skewed, leading to mis-categorization [7]. The prediction accuracy of the minority data is affected by this condition [8].

Class imbalance has significant effects on the learning process, and typically leads to classifiers that have low predictive accuracy for the minority class and tend to classify the majority of new samples into the majority class. Therefore, it is crucial to evaluate the performance of classifiers [3]. Many common learning algorithms perform poorly in categorization tasks because of the class-imbalance problem [5, 6].

Numerous categorization techniques, including the feature selection algorithm, cost-sensitive learning algorithm, undersampling technique over the majority class, and oversampling strategy over the minority class, have been extensively researched for this purpose [4].

Extensive research has been conducted on categorization models for heart disease [914]. Their machine learning-based prediction model for heart disease did not include methods to address imbalanced data in heart disease categorization.

Previous studies [38, 15] have used the Synthetic Minority Oversampling Technique (SMOTE) algorithm to solve imbalanced data problems in categorization. The study [16] demonstrated an effective SMOTE for heart attack prediction. They used the UCI dataset to train machine-learning algorithms to predict heart attacks. In this study [17], the SMOTE algorithm performs slightly better than the random oversampling technique in most cases.

The objective of this study is to develop a classifier model for heart disease categorization. Five models were created for this purpose: logistic regression (LR), decision tree (DT), random forest (RF), bagging logistic regression, and bagging decision tree. The dataset was obtained from Harapan Kita Hospital’s electronic health records in Jakarta, Indonesia. To handle disproportionate data, which is common in the healthcare sector, we used the SMOTE algorithm to improve the minority class prediction accuracy.

This paper is organized as follows. The research background and objective are discussed in Section 1. Related research is listed in Section 2. In Sections 3 and 4 discusses the proposed methods for categorizing heart disease. Section 5 explains the experimental results, analysis, and evaluation. Section 6 concludes the report by outlining the future research objectives.

We reviewed relevant literature for this study. Numerous machine-learning techniques have been employed to forecast heart disease, including LR [911], DT [9, 10], RF [913], and bagging [14]. Muhammad et al. [9] developed heart disease prediction methods using LR, support vector machine (SVM), k-nearest neighbors (KNN), DT, RF, and gradient boosting (GB). With the use of 6,553 patients’ electronic medical data, the study [10] built six machine-learning models (LR, RF, DT, XGBoost, Gaussian Naive Bayes, and KNN) for the prediction of asymptomatic carotid atherosclerosis in China.

The LR, RF, GB, and XGBoost algorithms were used in [11] to predict atherosclerotic cardiovascular disease (ASCVD) in Asian or Hispanic patients. Using data from the retrospective study and statistical analysis for atherosclerosis prediction, the study [12] applied a random forest for atherosclerosis in China. Based on data from a retrospective analysis of 3,302 Korean patients, one study [13] developed multiple machine-learning models (CART decision tree and RF) to predict the existence of coronary artery calcification.

The study [14] suggested using a bagging algorithm to identify heart disease warning symptoms in patients and compared the bagging algorithm’s performance with the DT method.

The dataset was collected from the Harapan Kita Hospital’s electronic health records in Jakarta, Indonesia. It comprises 108,191 in-patient medical records of heart disease (International Classification of Diseases [ICD] code I25.1), with two categories of diagnosis of how the patient left the hospital: (1) allowed to go home or referred to another hospital or no description and (2) died in less than 48 hours or died more than 48 hours.

4.1 Preprocessing

Data preparation, often known as data preprocessing, involves preparing data for further processing. As we are aware, data preparation includes preprocessing. This stage produces reliable data analytics for the operational data. These phases include quality evaluation, feature selection, data cleaning, and data transformation [18]. After data preprocessing, the final stage, before using the data to train the model, balanced the minority class data using SMOTE.

Figure 1 illustrates the data preprocessing steps. From the electronic health records, we accessed the heart disease medical record and laboratory test tables. There were 65 attributes of medical records in the table: (1) patient information, covers: registration_date, return_date, patient_name, medical record code, age, room_code and laboratory test code; doctor name; (2) icd code and icd code description; (3) icd_sk (International Subclassifications of Diseases); (4) 9 cm_sk (International Classification of Medical Procedures), but icd_sk and 9 cm_sk were removed due to the column which is mostly empty; (5) the patient’s condition when they leave the hospital and how they leave: is it at their own request, are they permitted to go home, are they referred to another hospital, or no description; and they died in less than 48 hours or more than 48 hours. Instances with empty attributes were also removed.

The next process involved combining the medical records and laboratory test tables. The laboratory test table has 58 attributes that store blood test results, but not all tests are performed; therefore, several attributes are left empty. The laboratory test only requires testing for khermchc, hermch, trombosit, leukosit, hemoglobin, eritrosit, and hematokrit. We obtained a dataset with 4,691 rows.

The next step was the feature selection process, which is a process of deciding which attributes are most suitable for the analysis to reduce bias when using the data by removing attributes that are not relevant. After screening and discussions with a medical expert, we used nine attributes and one class label (class target) based on a relevant test. Table 1 lists the descriptions of the attributes.

From the data preprocessing and data analysis, we discovered eight numerical attributes, including age, eritrosit, hematokrit, hemoglobin, hermch, khermchc, leukosit, and trombosit, and two categorical attributes, sex and diagnosis (class target). For the class target(diagnosis), there were two categories of how heart disease patients left the hospital: (1) allowed by the hospital to leave or referred to another hospital and (2) the patient died less or more than 48 hours. The doctor determines how the patient should be discharged from the hospital and uses these states as class labels. As a result, our dataset included ground-truth labels based on the opinions of medical experts.

4.2 Model Development

This model provides a mathematical representation to recognize a certain type of pattern. We trained a model with a dataset and provided it with an algorithm that can be used to reason and learn from the data. In this study, we used a categorization model. The model simply labels samples as an excellent score repetitively 100% accurate, but it cannot predict well on data that have not been observed; this is called overfitting.

To prevent this type of problem, there is a common best practice of dividing the dataset into two sets: training and testing sets for training any machine-learning model, without considering the nature of the dataset used. However, before that, we balanced the data with the help of the imblearn.over_sampling. SMOTE library to balance the dataset for each targeted class. After the data were balanced, the dataset was split into two or four datasets: X-train, Y-train, X-test, and Y-test. The training set is used to estimate different parameters or to compare different model performances, and the test set is used to examine the model [19].

The SMOTE approach uses a uniform probability distribution and KNN to produce artificial data. The operating principles are as follows. The procedure begins by dividing the data provided into majority and minority classes. The KNN technique then determines the k number of neighbors of each minority class data. Each minority sample used to create the synthetic samples had a unique nearest neighbor among the KNN chosen at random. Next, we determine Eq. (1) for the remaining members of the minority sample and its chosen nearest neighbor [7]:

dif=|Corigin-CNNk|,

where via random selection from a uniform probability distribution, CNNk is one of the kth neighbors of the minority class data and Corigin is the minority class data. Subsequently, to introduce randomness, the difference is multiplied by a random value drawn from a uniform probability distribution. Finally, using Eq. (2), a synthetic sample was obtained [7].

Csynth=Corigin+C|Corigin-CNNk|×Puniform,

where a random value from the uniform probability distribution is denoted as Puniform. These steps are repeated until the algorithm meets the stopping criterion, which may include the number of samples produced.

We split both the balanced and imbalanced datasets using the holdout method, which typically uses 80% of the data as the training set and the remaining 20% as the test set. The dataset was divided into two random sets with the training set accounting for 80% of both the imbalanced dataset (4,219 rows) and balanced dataset (6,750 rows), and the test set accounting for 20% of both the imbalanced (472 rows) and balanced datasets (1,688 rows) using sklearn.model_selection.train_test_split in Python. We also converted the categorical attribute value namely gender to ‘1’ for male and ‘0’ for female, for a patient who died in hospital within 48 hours or above 48 hours with 1. The patient with the state went home or was discharged at their own request or finished observation or transferred to another hospital or had no information with 0. To find the most suitable models for use with the dataset, we experimented with LR, DT, RF, and bagging LR and DT. A flowchart of the model development is shown in Figure 2.

4.3 Evaluation

The models were then evaluated using a variety of relative matrices, such as the confusion matrix, which measures the models based on the true positive value (TP), true negative value (TN), false positive (FP), and false negative (FN) by importing the sklearn.metrics.confusion_matrix library, and a categorization model report that consists of the accuracy score (TP and TN divided by FP and FN), precision, F1-score, and recall by importing sklearn.metrics.classification_report. Because the base data were imbalanced and then balanced with SMOTE, we decided not to use the confusion matrix; instead, we used another matrix that is good at examining models with imbalanced data, such as precision, recall, and F1-score. Precision scores were calculated by dividing the TPs by the total TP and TN. The recall was calculated by dividing the total TPs by the total TPs and FNs. The F1-score is a combination of precision and recall and is a type of harmonic mean precision and recall score [20, 21].

Eqs. (3)(5) contain the equations for precision, recall, and F1-score [18].

Precision=(TP)/(TP+FP),Recall=TP/(TP+FN),F1-socre=(2×Recall×Precison)/(Recall+Precsion).

As stated above, the confusion matrix is used to check the quality of recall and precision of a model, and the highest value of this matrix is 1.0, and the lowest is 0.0, the higher the F1-score, the better the model. In this categorization model, an undesirable FN value could cause the patient to fail to receive proper treatment and endanger the patient’s life. As we can conclude from the formula, the precision score measures the extent of error caused by FPs, whereas recall measures the extent of error caused by FNs, which means that in this case, it is more important. Therefore, the recall measurement score was monitored over to handle the unwanted false negative predictions. If the recall score is already near 1, then we can continue the precision and use the F1-score to improve both the recall and precision scores.

We also used the ROC AUC curve [22], which is believed to be more convenient because it employs the probability of prediction where F1 is not capable to do so. The AUC-ROC is a matrix used to measure performance, ROC is a probability curve, and AUC represents the degree of separability. Figure 3 shows an example of an ROC AUC curve.

The AUC-ROC measures how capable a model can distinguish between TPs and TNs, when it is above 5, it is believed to be able to distinguish as it gets higher until the ROC curve reaches 1, which means that the model is capable of predicting class 1 as 1 and 0 as 0, and vice versa when the ROC reaches 0, which means the model predicts negative classes, which means that class 1 was predicted as 0 and 0 as 1, and when the ROC was at 0.5, the model could not distinguish between the positive true and false values. A good model had an ROC above the AUC score, and a poor model had an ROC is near 0.5 or near 0. The ROC AUC can be increased by defining the weight of each class when tuning the hyperparameters of the model or balancing the dataset. yellowbrick.classifier.rocauc.roc_auc library and matplotlib.pyplot was added to calculate and display the curve [18].

5.1 Experimental Result

For the balancing dataset experiment using the SMOTE algorithm, the class_weight parameter was set to “balanced.” For each class, the weight is defined by Eqs. (6) and (7).

Weight for class 0=Total rows in the training settotal class Label*total row with the class label 0,Weight for class 1=Total rows in the training settotal class Label*total row with the class label 1.

The dataset that was balanced with SMOTE library which the class_weight = “balanced” was set in the hyper parameter in each model that was fitted with the SMOTE balanced dataset [23].

The experiment is described below.

Experiment 1: this experiment was conducted using a LR algorithm from the skleanr.linear_model.Logistic Regression library. We built the model as logmodel = Logistic Regression( solver=‘lbfgs,’ max_iter= 100). This means that lbfgs as the model solver with maximum iterations for the 1,000 experiments was fitted with the raw dataset that was split using sklearn.model_selection.train_test_split library as X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2), which divided our dataset into two sets: the training set and the test set. We then fitted the model with the training set and evaluated the model’s accuracy and performance using the test set.

Experiment 2: The second experiment created the same LR classifier model; however, we balanced the data using the imblearn.oversampling.SMOTE library. The LR model still used the same hyperparameters as in Experiment 1, and model evaluation was performed with the test set.

Experiment 3: In the third experiment, a bagging classifier was created from the sklearn.ensemble.BaggingClassifier library with the same LR model as in the first and second experiments and fitted it with balanced data using the oversampling technique and model evaluation with the test set.

Experiment 4: The fourth experiment was conducted by creating a DT classifier from the sklearn.tree.DecisionTreeClassifier library with model as clf = DecisionTreeClassifier(criterion =’entropy,’ max_depth= 7), which means the model searched the feature for a split into nodes by using entropy. The maximum depth of the tree split was seven after the model was created, and the model was fitted with the training set and tested with the test set on various matrices, as explained in Section 4.3.

Experiment 5: The fifth experiment was conducted by creating an ensemble bagging classifier with a base DT estimator fitted with an imbalanced dataset with entropy as the criterion and a maximum depth of seven leaf nodes. After the model was created, it was fitted with x_train and examined using the x_test dataset.

Experiment 6: The sixth experiment was conducted by creating a DT classifier with the same hyperparameter but fitted with a balanced dataset using the oversampling method (imblearn. over_sampling.SMOTE) and then tested with the same model evaluation matrix using the test set.

Experiment 7: In the seventh experiment, a bagging classifier was created with the base model of the DT classifier with the same hyperparameters. This means that the bagging classifier consisted of a few DTs that were fitted with a random subset of the balanced dataset, and the model was then evaluated with the test set.

Experiment 8: in the eighth experiment created an RF model for the categorization by using sklearn.ensemble.RandomForest Classifier library as rfc = RandomForestClassifier(max_depth= 7, criterion=‘entropy,’ n_estimators= 7). This method uses entropy to split the feature with a maximum depth of seven splits. The dataset was split into two sets as the training set and the test set was used to train the model by fitting it to the model; then, we evaluated the model with the test set.

Experiment 9: In the ninth experiment, an RF model was created for categorization, but with a balanced dataset using the SMOTE oversampling method from the imblearn.over_sampling. SMOTE library, and the model was evaluated with the test set.

Experiment 10: In the tenth experiment, a bagging classifier was created from the RF classifier with the same hyperparameters and balanced dataset, fitted with the training subset, and tested with the test subset.

5.2 Comparation of Model Performance

From the experiments which was conducted, we evaluated and compared the performance of each model. Tables 2 and 3 show the performance of each model with and without SMOTE.

In most measurements of each model, the SMOTE algorithm enhanced the performance of the model, except for the precision of bagging LR and DT and the AUC of bagging DT. The precision, recall, F1-score, and AUC of LR, DT, and RF with the SMOTE algorithm were slightly better than those without the algorithm. The recall, F1-score, and AUC of bagging LR showed significant improvement when applying the SMOTE algorithm. In the bagging DT, SMOTE algorithm enhanced the recall and F1-score. Bagging DT outperformed in precision and AUC without the SMOTE algorithm and surpassed the recall and F1-score with the SMOTE algorithm.

Based on [2426], SMOTE enhances the recall value. Our 5-classifier model also attained a recall enhancement by applying the SMOTE algorithm. With an improvement in recall, the false-negative values decreased.

Figure 4 shows the ROC curves of the five classifier models without the SMOTE algorithm. The ROC curves of the five classifier models obtained using the SMOTE algorithm are shown in Figure 5. The bagging DT surpassed both with and without the SMOTE algorithm.

Table 4 lists the performance of each model before and after the SMOTE algorithm was used to predict the minority class. SMOTE significantly enhanced the precision and F1-score of all the models.

This is the first study to present a systematic comparison of oversampling algorithms for heart disease categorization using data from Harapan Kita Hospital’s electronic heath records in Jakarta, Indonesia. In comparison with [16], we obtained factual results because we used data directly from hospitals.

Previous research has confirmed our findings, indicating that the validity of our SMOTE approach for learning from imbalanced data has proven to be successful in a wide range of applications across multiple domains. SMOTE inspired several approaches to address the issue of class imbalance and made significant contributions to supervised learning paradigms [6]. In the medical field, the classification of imbalanced data using the SMOTE algorithm attains a performance superior to that of several chest X-ray image databases in the automatic computerized detection of pulmonary [4]. In addition, research on the expansion and classification of imbalanced data based on the SMOTE algorithm for the imbalanced datasets of Pima, WDBC, WPBC, Ionosphere, and Breast-cancer-Wisconsin shows that the classification effect is better [5]. Compared to [4] and [5], our research improves the process by comparing experiments with SMOTE to experiments without SMOTE for each algorithm: LR, DT, RF, bagging LR, and bagging DT. As a result, we were able to investigate the effect of SMOTE on each algorithm.

Finally, this study contributes to the ability of machine-learning-based models to categorize imbalanced data using SMOTE, thereby improving model prediction accuracy, particularly for minority classes. This result contributes to the medical field where imbalanced data are frequently encountered.

Categorization models using SMOTE for predicting heart disease from Harapan Kita Hospital’s electronic health records were used in the study. These models include LR, DT, RF, bagging LR, and bagging DT, which have improved the model prediction accuracy, particularly for minority classes. For future research, we intend to explore ensemble methods and deep learning to optimize the prediction accuracy.

Fig. 1.

Data preprocessing.


Fig. 2.

Model development.


Fig. 3.

ROC AOC curve example [18].


Fig. 4.

ROC curves of five classifier models without SMOTE algorithm: (a) LR, (b) DT, (c) RF, (d) bagging LR, (e) bagging DT.


Fig. 5.

ROC curves of five classifier models with SMOTE algorithm: (a) LR, (b) DT, (c) RF, (d) bagging LR, (e) bagging DT.


Table. 1.

Table 1. Attribute description.

Attribute nameDescription
gendermale or female
ageage in years
eritrositthe number of red blood cells in the patient’s blood vessels
hematokritpercentage of red blood cells in blood
hemoglobinthe amount of oxygen and iron-carrying protein in the patient’s body
hermchthe average mass of hemoglobin per red blood cell in a blood sample from a patient
khermchcthe patient’s average hemoglobin concentration per red blood cell
leukositwhite blood cell number in the patient’s blood vessels
trombositthe number of platelets in the patient’s blood for blood clotting
diagnoseclass label (class target)

Table. 2.

Table 2. Performance of each model without SMOTE algorithm.

ModelPerformance measurement
Precision (%)Recall (%)F1-score (%)AUC (%)
LR47504964
DT70606362
RF71586170
Bagging LR72 515069
Bagging DT98667397

Table. 3.

Table 3. Performance of each model with SMOTE algorithm.

ModelPerformance measurement
Precision (%)Recall (%)F1-score (%)AUC (%)
LR66666672
DT76757582
RF78787887
Bagging LR66666673
Bagging DT83838392

Table. 4.

Table 4. Performance of each model before and after using SMOTE algorithm in predicting minority class.

ModelPerformance measurement using SMOTE
Precision (%)Recall (%)F1-score (%)
BeforeAfterBeforeAfterBeforeAfter
LR96667671566
DT107081841876
RF147582812478
Bagging LR96667671566
Bagging DT198176763078

  1. World Health Organization. (2020) . The top 10 causes of death. [Online]. Available: from https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death
  2. Ministry of Health, Republic of Indonesia. (2021) . World Heart Day 2021 Commemoration: Maintain Your Heart for a Healthier Life. Available: https://promkes.kemkes.go.id/peringatan-hari-jantung-sedunia-2021-jaga-jantungmu-untuk-hidup-lebih-sehat
  3. Blagus, R, and Lusa, L (2013). SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 14. article no. 106
    Pubmed KoreaMed
  4. Wang, J, Xu, M, Wang, H, and Zhang, J . Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding., Proceedings of 2006 8th international Conference on Signal Processing, 2006, Guilin, China, Array. https://doi.org/10.1109/ICOSP.2006.345752
  5. Wang, S, Dai, Y, Shen, J, and Xuan, J (2021). Research on expansion and classification of imbalanced data based on SMOTE algorithm. Scientific Reports. 11. article no. 24039
    CrossRef
  6. Fernandez, A, Garcia, S, Herrera, F, and Chawla, NV (2018). SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research. 61, 863-905. https://doi.org/10.1613/jair.1.11192
    CrossRef
  7. Meidianingsih, Q, Erfiani, , and Sartono, B (2017). The study of safe-level SMOTE method in unbalanced data classification. International Journal of Scientific & Engineering Research. 8, 1167-1171.
  8. Pradipta, GA, Wardoyo, R, Musdholifah, A, Sanjaya, INH, and Ismail, M . SMOTE for handling imbalanced data problem: a review., Proceedings of 2021 6th International Conference on Informatics and Computing (ICIC), 2021, Jakarta, Indonesia, Array, pp.1-8. https://doi.org/10.1109/ICIC54025.2021.9632912
  9. Muhammad, Y, Tahir, M, Hayat, M, and Chong, KT (2020). Early and accurate detection and diagnosis of heart disease using intelligent computational model. Scientific Reports. 10. article no. 19747
    CrossRef
  10. Fan, J, Chen, M, Luo, J, Yang, S, Shi, J, and Yao, Q (2021). The prediction of asymptomatic carotid atherosclerosis with electronic health records: a comparative study of six machine learning models. BMC Medical Informatics and Decision Making. 21. article no. 115
    Pubmed KoreaMed CrossRef
  11. Ward, A, Sarraju, A, Chung, S, Li, J, Harrington, R, Heidenreich, P, Palaniappan, L, Scheinker, D, and Rodriguez, F (2020). Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population. NPJ Digital Medicine. 3. article no. 125
    Pubmed KoreaMed CrossRef
  12. Chen, Z, Yang, M, Wen, Y, Jiang, S, Liu, W, and Huang, H (2022). Prediction of atherosclerosis using machine learning based on operations research. Mathematical Biosciences and Engineering. 19, 4892-4910. https://doi.org/10.3934/mbe.2022229
    Pubmed CrossRef
  13. Park, S, Hong, M, Lee, H, Cho, NJ, Lee, EY, Lee, WY, Rhee, EJ, and Gil, HW (2021). New model for predicting the presence of coronary artery calcification. Journal of Clinical Medicine. 10. article no. 457
  14. Tu, MC, Shin, D, and Shin, D . Effective diagnosis of heart disease through bagging approach., Proceedings of 2009 2nd International Conference on Biomedical Engineering and Informatics, 2009, Tianjin, China, Array, pp.1-4. https://doi.org/10.1109/BMEI.2009.5301650
  15. Lee, H, Kim, J, and Kim, S (2017). Gaussian-based SMOTE algorithm for solving skewed class distributions. International Journal of Fuzzy Logic and Intelligent Systems. 17, 229-234. https://doi.org/10.5391/IJFIS.2017.17.4.229
    CrossRef
  16. Waqar, M, Dawood, H, Dawood, H, Majeed, N, Banjar, A, and Alharbey, R (2021). An efficient SMOTE-based deep learning model for heart attack prediction. Scientific Programming. 2021. article no. 6621622
    CrossRef
  17. Glazkova, A. (2020) . A comparison of synthetic oversampling methods for multi-class text classification. Available: https://arxiv.org/abs/2008.04636
  18. Cho, E, Chang, TW, and Hwang, G (2022). Data preprocessing combination to improve the performance of quality classification in the manufacturing process. Electronics. 11. article no. 477
    CrossRef
  19. Limberg, C, Wersing, H, and Ritter, H (2020). Beyond cross-validation–accuracy estimation for incremental and active learning models. Machine Learning and Knowledge Extraction. 2, 327-346. https://doi.org/10.3390/make2030018
    CrossRef
  20. Zeya, LT. (2021) . Precision and recall made simple. [Online]. Available: Retrieved from https://towardsdatascience.com/precisionand-recall-made-simple-afb5e098970f
  21. Zeya, LT. (2021) . Essential things you need to know about F1-score. Available: https://towardsdatascience.com/essential-things-youneed-to-know-about-f1-score-dbd973bf1a3
  22. scikit yellow brick. (2019) . ROCAUC. Available: https://www.scikit-yb.org/en/latest/api/classifier/rocauc.html
  23. Singh, K. (2020) . How to Improve class imbalance using class weights in machine learning. Available: https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/
  24. Turlapati, VPK, and Prusty, MR (2020). Outlier-SMOTE: a refined oversampling technique for improved detection of COVID-19. Intelligence-Based Medicine. 3. article no. 100023
    Pubmed KoreaMed
  25. Moda, L. (2020) . COVID-19 Optimizing recall with SMOTE. Available: https://www.kaggle.com/code/lukmoda/covid-19-optimizing-recall-with-smote/notebook
  26. Korstanje, J. (2021) . SMOTE. Available: https://towardsdatascience.com/smote-fdce2f605729

Mediana Aryuni is a Faculty Member in the School of Information Systems, Bina Nusantara University, Jakarta, Indonesia. Her research interests are data mining and text mining. E-mail: mediana.aryuni@binus.ac.id,

Suko Adiarto is a cardiologist sub specializing in interventional cardiology and a medical staff at the Department of Cardiology and Vascular Medicine, University of Indonesia-National Heart Center Harapan Kita. He has a special interest research in the vascular field. E-mail: sukoadiarto@gmail.com

Eka Miranda is a Faculty Member in the School of Information Systems, Bina Nusantara University, Jakarta, Indonesia. Her research interests are data mining and text mining. E-mail: ekamiranda@binus.ac.id

Evaristus Didik Madyatmadja is a Faculty Member in the School of Information Systems, Bina Nusantara University, Jakarta, Indonesia. His research interests are data mining, E-Government and smart city. E-mail: emadyatmadja@binus.edu

Albert Verasius Dian Sano is a Faculty Member in the School of Computer Science, Bina Nusantara University, Jakarta, Indonesia. His research interests are data mining, E-Government and smart city. E-mail: albert.sano@binus.edu

Elvin Sestomi is a student of in the School of Information Systems, Bina Nusantara University, Jakarta, Indonesia. Her research interest is business intelligence and advanced database. E-mail: elvin.sestomi@binus.ac.id

Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(2): 140-151

Published online June 25, 2023 https://doi.org/10.5391/IJFIS.2023.23.2.140

Copyright © The Korean Institute of Intelligent Systems.

Imbalanced Learning in Heart Disease Categorization: Improving Minority Class Prediction Accuracy Using the SMOTE Algorithm

Mediana Aryuni1 , Suko Adiarto2 , Eka Miranda1 , Evaristus Didik Madyatmadja1 , Albert Verasius Dian Sano3 , and Elvin Sestomi1

1Department of Information Systems, Bina Nusantara University, Jakarta, Indonesia
2Department of Cardiology and Vascular Medicine, University of Indonesia, Jakarta, Indonesia
3Department of Computer Science, Bina Nusantara University, Jakarta, Indonesia

Correspondence to:Mediana Aryuni (mediana.aryuni@binus.ac.id)

Received: December 5, 2022; Revised: April 6, 2023; Accepted: May 22, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

In the field of medical data mining, imbalanced data categorization occurs frequently, which typically leads to classifiers with low predictive accuracy for the minority class. This study aims to construct a classifier model for imbalanced data using the SMOTE oversampling algorithm and a heart disease dataset obtained from Harapan Kita Hospital. The categorization model utilized logistic regression, decision tree, random forest, bagging logistic regression, and bagging decision tree. SMOTE improved the model prediction accuracy with imbalanced data, particularly for minority classes.

Keywords: Heart disease, Prediction, Imbalanced, Accuracy, SMOTE

1. Introduction

According to the World Health Organization (WHO), 16% of all deaths worldwide are caused by heart disease. This disease has shown the highest increase in deaths since 2000, with a growth of more than 2 million deaths to 8.9 million deaths in 2019 [1]. Furthermore, according to the Institute for Health Metrics and Evaluation, the number of deaths per year caused by coronary heart disease in Indonesia is 245,343 [2].

As a result, increased prevention efforts are required, such as early detection of heart disease, which can be achieved through a doctor’s diagnosis. Alternative tools for predicting heart disease may provide solutions to these problems. For example, a predictive model was developed using machine learning-based predictive models. Therefore, it is crucial that heart disease prediction models are continuously improved for the purpose of developing assistive devices.

Categorization is the process of categorizing a specific set of data into classes. The process begins by predicting the class based on the provided data points. Classes are also referred to as targets, labels, or categories.

Categorization aims to create a rule that can be used to determine the category of new data based on a dataset with a known category (training set). Several categorization algorithms were established from the attribute values assessed for each dataset [3].

In the fields of medical pattern recognition and data mining, imbalanced data categorization frequently occurs [4] and is caused by various conditions in various sample collections [5].

Imbalanced data occurs when there are significantly more observations in one class than in the training data [6, 7]. When the distribution of classes is imbalanced, the categorization outcome is skewed, leading to mis-categorization [7]. The prediction accuracy of the minority data is affected by this condition [8].

Class imbalance has significant effects on the learning process, and typically leads to classifiers that have low predictive accuracy for the minority class and tend to classify the majority of new samples into the majority class. Therefore, it is crucial to evaluate the performance of classifiers [3]. Many common learning algorithms perform poorly in categorization tasks because of the class-imbalance problem [5, 6].

Numerous categorization techniques, including the feature selection algorithm, cost-sensitive learning algorithm, undersampling technique over the majority class, and oversampling strategy over the minority class, have been extensively researched for this purpose [4].

Extensive research has been conducted on categorization models for heart disease [914]. Their machine learning-based prediction model for heart disease did not include methods to address imbalanced data in heart disease categorization.

Previous studies [38, 15] have used the Synthetic Minority Oversampling Technique (SMOTE) algorithm to solve imbalanced data problems in categorization. The study [16] demonstrated an effective SMOTE for heart attack prediction. They used the UCI dataset to train machine-learning algorithms to predict heart attacks. In this study [17], the SMOTE algorithm performs slightly better than the random oversampling technique in most cases.

The objective of this study is to develop a classifier model for heart disease categorization. Five models were created for this purpose: logistic regression (LR), decision tree (DT), random forest (RF), bagging logistic regression, and bagging decision tree. The dataset was obtained from Harapan Kita Hospital’s electronic health records in Jakarta, Indonesia. To handle disproportionate data, which is common in the healthcare sector, we used the SMOTE algorithm to improve the minority class prediction accuracy.

This paper is organized as follows. The research background and objective are discussed in Section 1. Related research is listed in Section 2. In Sections 3 and 4 discusses the proposed methods for categorizing heart disease. Section 5 explains the experimental results, analysis, and evaluation. Section 6 concludes the report by outlining the future research objectives.

2. Related Works

We reviewed relevant literature for this study. Numerous machine-learning techniques have been employed to forecast heart disease, including LR [911], DT [9, 10], RF [913], and bagging [14]. Muhammad et al. [9] developed heart disease prediction methods using LR, support vector machine (SVM), k-nearest neighbors (KNN), DT, RF, and gradient boosting (GB). With the use of 6,553 patients’ electronic medical data, the study [10] built six machine-learning models (LR, RF, DT, XGBoost, Gaussian Naive Bayes, and KNN) for the prediction of asymptomatic carotid atherosclerosis in China.

The LR, RF, GB, and XGBoost algorithms were used in [11] to predict atherosclerotic cardiovascular disease (ASCVD) in Asian or Hispanic patients. Using data from the retrospective study and statistical analysis for atherosclerosis prediction, the study [12] applied a random forest for atherosclerosis in China. Based on data from a retrospective analysis of 3,302 Korean patients, one study [13] developed multiple machine-learning models (CART decision tree and RF) to predict the existence of coronary artery calcification.

The study [14] suggested using a bagging algorithm to identify heart disease warning symptoms in patients and compared the bagging algorithm’s performance with the DT method.

3. Dataset

The dataset was collected from the Harapan Kita Hospital’s electronic health records in Jakarta, Indonesia. It comprises 108,191 in-patient medical records of heart disease (International Classification of Diseases [ICD] code I25.1), with two categories of diagnosis of how the patient left the hospital: (1) allowed to go home or referred to another hospital or no description and (2) died in less than 48 hours or died more than 48 hours.

4. Methodology

4.1 Preprocessing

Data preparation, often known as data preprocessing, involves preparing data for further processing. As we are aware, data preparation includes preprocessing. This stage produces reliable data analytics for the operational data. These phases include quality evaluation, feature selection, data cleaning, and data transformation [18]. After data preprocessing, the final stage, before using the data to train the model, balanced the minority class data using SMOTE.

Figure 1 illustrates the data preprocessing steps. From the electronic health records, we accessed the heart disease medical record and laboratory test tables. There were 65 attributes of medical records in the table: (1) patient information, covers: registration_date, return_date, patient_name, medical record code, age, room_code and laboratory test code; doctor name; (2) icd code and icd code description; (3) icd_sk (International Subclassifications of Diseases); (4) 9 cm_sk (International Classification of Medical Procedures), but icd_sk and 9 cm_sk were removed due to the column which is mostly empty; (5) the patient’s condition when they leave the hospital and how they leave: is it at their own request, are they permitted to go home, are they referred to another hospital, or no description; and they died in less than 48 hours or more than 48 hours. Instances with empty attributes were also removed.

The next process involved combining the medical records and laboratory test tables. The laboratory test table has 58 attributes that store blood test results, but not all tests are performed; therefore, several attributes are left empty. The laboratory test only requires testing for khermchc, hermch, trombosit, leukosit, hemoglobin, eritrosit, and hematokrit. We obtained a dataset with 4,691 rows.

The next step was the feature selection process, which is a process of deciding which attributes are most suitable for the analysis to reduce bias when using the data by removing attributes that are not relevant. After screening and discussions with a medical expert, we used nine attributes and one class label (class target) based on a relevant test. Table 1 lists the descriptions of the attributes.

From the data preprocessing and data analysis, we discovered eight numerical attributes, including age, eritrosit, hematokrit, hemoglobin, hermch, khermchc, leukosit, and trombosit, and two categorical attributes, sex and diagnosis (class target). For the class target(diagnosis), there were two categories of how heart disease patients left the hospital: (1) allowed by the hospital to leave or referred to another hospital and (2) the patient died less or more than 48 hours. The doctor determines how the patient should be discharged from the hospital and uses these states as class labels. As a result, our dataset included ground-truth labels based on the opinions of medical experts.

4.2 Model Development

This model provides a mathematical representation to recognize a certain type of pattern. We trained a model with a dataset and provided it with an algorithm that can be used to reason and learn from the data. In this study, we used a categorization model. The model simply labels samples as an excellent score repetitively 100% accurate, but it cannot predict well on data that have not been observed; this is called overfitting.

To prevent this type of problem, there is a common best practice of dividing the dataset into two sets: training and testing sets for training any machine-learning model, without considering the nature of the dataset used. However, before that, we balanced the data with the help of the imblearn.over_sampling. SMOTE library to balance the dataset for each targeted class. After the data were balanced, the dataset was split into two or four datasets: X-train, Y-train, X-test, and Y-test. The training set is used to estimate different parameters or to compare different model performances, and the test set is used to examine the model [19].

The SMOTE approach uses a uniform probability distribution and KNN to produce artificial data. The operating principles are as follows. The procedure begins by dividing the data provided into majority and minority classes. The KNN technique then determines the k number of neighbors of each minority class data. Each minority sample used to create the synthetic samples had a unique nearest neighbor among the KNN chosen at random. Next, we determine Eq. (1) for the remaining members of the minority sample and its chosen nearest neighbor [7]:

dif=|Corigin-CNNk|,

where via random selection from a uniform probability distribution, CNNk is one of the kth neighbors of the minority class data and Corigin is the minority class data. Subsequently, to introduce randomness, the difference is multiplied by a random value drawn from a uniform probability distribution. Finally, using Eq. (2), a synthetic sample was obtained [7].

Csynth=Corigin+C|Corigin-CNNk|×Puniform,

where a random value from the uniform probability distribution is denoted as Puniform. These steps are repeated until the algorithm meets the stopping criterion, which may include the number of samples produced.

We split both the balanced and imbalanced datasets using the holdout method, which typically uses 80% of the data as the training set and the remaining 20% as the test set. The dataset was divided into two random sets with the training set accounting for 80% of both the imbalanced dataset (4,219 rows) and balanced dataset (6,750 rows), and the test set accounting for 20% of both the imbalanced (472 rows) and balanced datasets (1,688 rows) using sklearn.model_selection.train_test_split in Python. We also converted the categorical attribute value namely gender to ‘1’ for male and ‘0’ for female, for a patient who died in hospital within 48 hours or above 48 hours with 1. The patient with the state went home or was discharged at their own request or finished observation or transferred to another hospital or had no information with 0. To find the most suitable models for use with the dataset, we experimented with LR, DT, RF, and bagging LR and DT. A flowchart of the model development is shown in Figure 2.

4.3 Evaluation

The models were then evaluated using a variety of relative matrices, such as the confusion matrix, which measures the models based on the true positive value (TP), true negative value (TN), false positive (FP), and false negative (FN) by importing the sklearn.metrics.confusion_matrix library, and a categorization model report that consists of the accuracy score (TP and TN divided by FP and FN), precision, F1-score, and recall by importing sklearn.metrics.classification_report. Because the base data were imbalanced and then balanced with SMOTE, we decided not to use the confusion matrix; instead, we used another matrix that is good at examining models with imbalanced data, such as precision, recall, and F1-score. Precision scores were calculated by dividing the TPs by the total TP and TN. The recall was calculated by dividing the total TPs by the total TPs and FNs. The F1-score is a combination of precision and recall and is a type of harmonic mean precision and recall score [20, 21].

Eqs. (3)(5) contain the equations for precision, recall, and F1-score [18].

Precision=(TP)/(TP+FP),Recall=TP/(TP+FN),F1-socre=(2×Recall×Precison)/(Recall+Precsion).

As stated above, the confusion matrix is used to check the quality of recall and precision of a model, and the highest value of this matrix is 1.0, and the lowest is 0.0, the higher the F1-score, the better the model. In this categorization model, an undesirable FN value could cause the patient to fail to receive proper treatment and endanger the patient’s life. As we can conclude from the formula, the precision score measures the extent of error caused by FPs, whereas recall measures the extent of error caused by FNs, which means that in this case, it is more important. Therefore, the recall measurement score was monitored over to handle the unwanted false negative predictions. If the recall score is already near 1, then we can continue the precision and use the F1-score to improve both the recall and precision scores.

We also used the ROC AUC curve [22], which is believed to be more convenient because it employs the probability of prediction where F1 is not capable to do so. The AUC-ROC is a matrix used to measure performance, ROC is a probability curve, and AUC represents the degree of separability. Figure 3 shows an example of an ROC AUC curve.

The AUC-ROC measures how capable a model can distinguish between TPs and TNs, when it is above 5, it is believed to be able to distinguish as it gets higher until the ROC curve reaches 1, which means that the model is capable of predicting class 1 as 1 and 0 as 0, and vice versa when the ROC reaches 0, which means the model predicts negative classes, which means that class 1 was predicted as 0 and 0 as 1, and when the ROC was at 0.5, the model could not distinguish between the positive true and false values. A good model had an ROC above the AUC score, and a poor model had an ROC is near 0.5 or near 0. The ROC AUC can be increased by defining the weight of each class when tuning the hyperparameters of the model or balancing the dataset. yellowbrick.classifier.rocauc.roc_auc library and matplotlib.pyplot was added to calculate and display the curve [18].

5. Results

5.1 Experimental Result

For the balancing dataset experiment using the SMOTE algorithm, the class_weight parameter was set to “balanced.” For each class, the weight is defined by Eqs. (6) and (7).

Weight for class 0=Total rows in the training settotal class Label*total row with the class label 0,Weight for class 1=Total rows in the training settotal class Label*total row with the class label 1.

The dataset that was balanced with SMOTE library which the class_weight = “balanced” was set in the hyper parameter in each model that was fitted with the SMOTE balanced dataset [23].

The experiment is described below.

Experiment 1: this experiment was conducted using a LR algorithm from the skleanr.linear_model.Logistic Regression library. We built the model as logmodel = Logistic Regression( solver=‘lbfgs,’ max_iter= 100). This means that lbfgs as the model solver with maximum iterations for the 1,000 experiments was fitted with the raw dataset that was split using sklearn.model_selection.train_test_split library as X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2), which divided our dataset into two sets: the training set and the test set. We then fitted the model with the training set and evaluated the model’s accuracy and performance using the test set.

Experiment 2: The second experiment created the same LR classifier model; however, we balanced the data using the imblearn.oversampling.SMOTE library. The LR model still used the same hyperparameters as in Experiment 1, and model evaluation was performed with the test set.

Experiment 3: In the third experiment, a bagging classifier was created from the sklearn.ensemble.BaggingClassifier library with the same LR model as in the first and second experiments and fitted it with balanced data using the oversampling technique and model evaluation with the test set.

Experiment 4: The fourth experiment was conducted by creating a DT classifier from the sklearn.tree.DecisionTreeClassifier library with model as clf = DecisionTreeClassifier(criterion =’entropy,’ max_depth= 7), which means the model searched the feature for a split into nodes by using entropy. The maximum depth of the tree split was seven after the model was created, and the model was fitted with the training set and tested with the test set on various matrices, as explained in Section 4.3.

Experiment 5: The fifth experiment was conducted by creating an ensemble bagging classifier with a base DT estimator fitted with an imbalanced dataset with entropy as the criterion and a maximum depth of seven leaf nodes. After the model was created, it was fitted with x_train and examined using the x_test dataset.

Experiment 6: The sixth experiment was conducted by creating a DT classifier with the same hyperparameter but fitted with a balanced dataset using the oversampling method (imblearn. over_sampling.SMOTE) and then tested with the same model evaluation matrix using the test set.

Experiment 7: In the seventh experiment, a bagging classifier was created with the base model of the DT classifier with the same hyperparameters. This means that the bagging classifier consisted of a few DTs that were fitted with a random subset of the balanced dataset, and the model was then evaluated with the test set.

Experiment 8: in the eighth experiment created an RF model for the categorization by using sklearn.ensemble.RandomForest Classifier library as rfc = RandomForestClassifier(max_depth= 7, criterion=‘entropy,’ n_estimators= 7). This method uses entropy to split the feature with a maximum depth of seven splits. The dataset was split into two sets as the training set and the test set was used to train the model by fitting it to the model; then, we evaluated the model with the test set.

Experiment 9: In the ninth experiment, an RF model was created for categorization, but with a balanced dataset using the SMOTE oversampling method from the imblearn.over_sampling. SMOTE library, and the model was evaluated with the test set.

Experiment 10: In the tenth experiment, a bagging classifier was created from the RF classifier with the same hyperparameters and balanced dataset, fitted with the training subset, and tested with the test subset.

5.2 Comparation of Model Performance

From the experiments which was conducted, we evaluated and compared the performance of each model. Tables 2 and 3 show the performance of each model with and without SMOTE.

In most measurements of each model, the SMOTE algorithm enhanced the performance of the model, except for the precision of bagging LR and DT and the AUC of bagging DT. The precision, recall, F1-score, and AUC of LR, DT, and RF with the SMOTE algorithm were slightly better than those without the algorithm. The recall, F1-score, and AUC of bagging LR showed significant improvement when applying the SMOTE algorithm. In the bagging DT, SMOTE algorithm enhanced the recall and F1-score. Bagging DT outperformed in precision and AUC without the SMOTE algorithm and surpassed the recall and F1-score with the SMOTE algorithm.

Based on [2426], SMOTE enhances the recall value. Our 5-classifier model also attained a recall enhancement by applying the SMOTE algorithm. With an improvement in recall, the false-negative values decreased.

Figure 4 shows the ROC curves of the five classifier models without the SMOTE algorithm. The ROC curves of the five classifier models obtained using the SMOTE algorithm are shown in Figure 5. The bagging DT surpassed both with and without the SMOTE algorithm.

Table 4 lists the performance of each model before and after the SMOTE algorithm was used to predict the minority class. SMOTE significantly enhanced the precision and F1-score of all the models.

This is the first study to present a systematic comparison of oversampling algorithms for heart disease categorization using data from Harapan Kita Hospital’s electronic heath records in Jakarta, Indonesia. In comparison with [16], we obtained factual results because we used data directly from hospitals.

Previous research has confirmed our findings, indicating that the validity of our SMOTE approach for learning from imbalanced data has proven to be successful in a wide range of applications across multiple domains. SMOTE inspired several approaches to address the issue of class imbalance and made significant contributions to supervised learning paradigms [6]. In the medical field, the classification of imbalanced data using the SMOTE algorithm attains a performance superior to that of several chest X-ray image databases in the automatic computerized detection of pulmonary [4]. In addition, research on the expansion and classification of imbalanced data based on the SMOTE algorithm for the imbalanced datasets of Pima, WDBC, WPBC, Ionosphere, and Breast-cancer-Wisconsin shows that the classification effect is better [5]. Compared to [4] and [5], our research improves the process by comparing experiments with SMOTE to experiments without SMOTE for each algorithm: LR, DT, RF, bagging LR, and bagging DT. As a result, we were able to investigate the effect of SMOTE on each algorithm.

Finally, this study contributes to the ability of machine-learning-based models to categorize imbalanced data using SMOTE, thereby improving model prediction accuracy, particularly for minority classes. This result contributes to the medical field where imbalanced data are frequently encountered.

6. Conclusion

Categorization models using SMOTE for predicting heart disease from Harapan Kita Hospital’s electronic health records were used in the study. These models include LR, DT, RF, bagging LR, and bagging DT, which have improved the model prediction accuracy, particularly for minority classes. For future research, we intend to explore ensemble methods and deep learning to optimize the prediction accuracy.

Fig 1.

Figure 1.

Data preprocessing.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 140-151https://doi.org/10.5391/IJFIS.2023.23.2.140

Fig 2.

Figure 2.

Model development.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 140-151https://doi.org/10.5391/IJFIS.2023.23.2.140

Fig 3.

Figure 3.

ROC AOC curve example [18].

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 140-151https://doi.org/10.5391/IJFIS.2023.23.2.140

Fig 4.

Figure 4.

ROC curves of five classifier models without SMOTE algorithm: (a) LR, (b) DT, (c) RF, (d) bagging LR, (e) bagging DT.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 140-151https://doi.org/10.5391/IJFIS.2023.23.2.140

Fig 5.

Figure 5.

ROC curves of five classifier models with SMOTE algorithm: (a) LR, (b) DT, (c) RF, (d) bagging LR, (e) bagging DT.

The International Journal of Fuzzy Logic and Intelligent Systems 2023; 23: 140-151https://doi.org/10.5391/IJFIS.2023.23.2.140

Table 1 . Attribute description.

Attribute nameDescription
gendermale or female
ageage in years
eritrositthe number of red blood cells in the patient’s blood vessels
hematokritpercentage of red blood cells in blood
hemoglobinthe amount of oxygen and iron-carrying protein in the patient’s body
hermchthe average mass of hemoglobin per red blood cell in a blood sample from a patient
khermchcthe patient’s average hemoglobin concentration per red blood cell
leukositwhite blood cell number in the patient’s blood vessels
trombositthe number of platelets in the patient’s blood for blood clotting
diagnoseclass label (class target)

Table 2 . Performance of each model without SMOTE algorithm.

ModelPerformance measurement
Precision (%)Recall (%)F1-score (%)AUC (%)
LR47504964
DT70606362
RF71586170
Bagging LR72 515069
Bagging DT98667397

Table 3 . Performance of each model with SMOTE algorithm.

ModelPerformance measurement
Precision (%)Recall (%)F1-score (%)AUC (%)
LR66666672
DT76757582
RF78787887
Bagging LR66666673
Bagging DT83838392

Table 4 . Performance of each model before and after using SMOTE algorithm in predicting minority class.

ModelPerformance measurement using SMOTE
Precision (%)Recall (%)F1-score (%)
BeforeAfterBeforeAfterBeforeAfter
LR96667671566
DT107081841876
RF147582812478
Bagging LR96667671566
Bagging DT198176763078

References

  1. World Health Organization. (2020) . The top 10 causes of death. [Online]. Available: from https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death
  2. Ministry of Health, Republic of Indonesia. (2021) . World Heart Day 2021 Commemoration: Maintain Your Heart for a Healthier Life. Available: https://promkes.kemkes.go.id/peringatan-hari-jantung-sedunia-2021-jaga-jantungmu-untuk-hidup-lebih-sehat
  3. Blagus, R, and Lusa, L (2013). SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 14. article no. 106
    Pubmed KoreaMed
  4. Wang, J, Xu, M, Wang, H, and Zhang, J . Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding., Proceedings of 2006 8th international Conference on Signal Processing, 2006, Guilin, China, Array. https://doi.org/10.1109/ICOSP.2006.345752
  5. Wang, S, Dai, Y, Shen, J, and Xuan, J (2021). Research on expansion and classification of imbalanced data based on SMOTE algorithm. Scientific Reports. 11. article no. 24039
    CrossRef
  6. Fernandez, A, Garcia, S, Herrera, F, and Chawla, NV (2018). SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research. 61, 863-905. https://doi.org/10.1613/jair.1.11192
    CrossRef
  7. Meidianingsih, Q, Erfiani, , and Sartono, B (2017). The study of safe-level SMOTE method in unbalanced data classification. International Journal of Scientific & Engineering Research. 8, 1167-1171.
  8. Pradipta, GA, Wardoyo, R, Musdholifah, A, Sanjaya, INH, and Ismail, M . SMOTE for handling imbalanced data problem: a review., Proceedings of 2021 6th International Conference on Informatics and Computing (ICIC), 2021, Jakarta, Indonesia, Array, pp.1-8. https://doi.org/10.1109/ICIC54025.2021.9632912
  9. Muhammad, Y, Tahir, M, Hayat, M, and Chong, KT (2020). Early and accurate detection and diagnosis of heart disease using intelligent computational model. Scientific Reports. 10. article no. 19747
    CrossRef
  10. Fan, J, Chen, M, Luo, J, Yang, S, Shi, J, and Yao, Q (2021). The prediction of asymptomatic carotid atherosclerosis with electronic health records: a comparative study of six machine learning models. BMC Medical Informatics and Decision Making. 21. article no. 115
    Pubmed KoreaMed CrossRef
  11. Ward, A, Sarraju, A, Chung, S, Li, J, Harrington, R, Heidenreich, P, Palaniappan, L, Scheinker, D, and Rodriguez, F (2020). Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population. NPJ Digital Medicine. 3. article no. 125
    Pubmed KoreaMed CrossRef
  12. Chen, Z, Yang, M, Wen, Y, Jiang, S, Liu, W, and Huang, H (2022). Prediction of atherosclerosis using machine learning based on operations research. Mathematical Biosciences and Engineering. 19, 4892-4910. https://doi.org/10.3934/mbe.2022229
    Pubmed CrossRef
  13. Park, S, Hong, M, Lee, H, Cho, NJ, Lee, EY, Lee, WY, Rhee, EJ, and Gil, HW (2021). New model for predicting the presence of coronary artery calcification. Journal of Clinical Medicine. 10. article no. 457
  14. Tu, MC, Shin, D, and Shin, D . Effective diagnosis of heart disease through bagging approach., Proceedings of 2009 2nd International Conference on Biomedical Engineering and Informatics, 2009, Tianjin, China, Array, pp.1-4. https://doi.org/10.1109/BMEI.2009.5301650
  15. Lee, H, Kim, J, and Kim, S (2017). Gaussian-based SMOTE algorithm for solving skewed class distributions. International Journal of Fuzzy Logic and Intelligent Systems. 17, 229-234. https://doi.org/10.5391/IJFIS.2017.17.4.229
    CrossRef
  16. Waqar, M, Dawood, H, Dawood, H, Majeed, N, Banjar, A, and Alharbey, R (2021). An efficient SMOTE-based deep learning model for heart attack prediction. Scientific Programming. 2021. article no. 6621622
    CrossRef
  17. Glazkova, A. (2020) . A comparison of synthetic oversampling methods for multi-class text classification. Available: https://arxiv.org/abs/2008.04636
  18. Cho, E, Chang, TW, and Hwang, G (2022). Data preprocessing combination to improve the performance of quality classification in the manufacturing process. Electronics. 11. article no. 477
    CrossRef
  19. Limberg, C, Wersing, H, and Ritter, H (2020). Beyond cross-validation–accuracy estimation for incremental and active learning models. Machine Learning and Knowledge Extraction. 2, 327-346. https://doi.org/10.3390/make2030018
    CrossRef
  20. Zeya, LT. (2021) . Precision and recall made simple. [Online]. Available: Retrieved from https://towardsdatascience.com/precisionand-recall-made-simple-afb5e098970f
  21. Zeya, LT. (2021) . Essential things you need to know about F1-score. Available: https://towardsdatascience.com/essential-things-youneed-to-know-about-f1-score-dbd973bf1a3
  22. scikit yellow brick. (2019) . ROCAUC. Available: https://www.scikit-yb.org/en/latest/api/classifier/rocauc.html
  23. Singh, K. (2020) . How to Improve class imbalance using class weights in machine learning. Available: https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/
  24. Turlapati, VPK, and Prusty, MR (2020). Outlier-SMOTE: a refined oversampling technique for improved detection of COVID-19. Intelligence-Based Medicine. 3. article no. 100023
    Pubmed KoreaMed
  25. Moda, L. (2020) . COVID-19 Optimizing recall with SMOTE. Available: https://www.kaggle.com/code/lukmoda/covid-19-optimizing-recall-with-smote/notebook
  26. Korstanje, J. (2021) . SMOTE. Available: https://towardsdatascience.com/smote-fdce2f605729

Share this article on :

Related articles in IJFIS