Int. J. Fuzzy Log. Intell. Syst. 2017; 17(4): 229-234
Published online December 25, 2017
https://doi.org/10.5391/IJFIS.2017.17.4.229
© The Korean Institute of Intelligent Systems
Hansoo Lee, Jonggeun Kim, and Sungshin Kim
Department of Electrical and Computer Engineering, Pusan National University, Busan, Korea
Correspondence to :
Sungshin Kim (sskim@pusan.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Sufficient amount of learning data is an essential condition to implement a classifier with excellent performance. However, the obtained data usually follow a significantly biased distribution of classes. It is called a class imbalance problem, which is one of the frequently occurred issues in the real world applications. This problem causes a considerable performance drop because most of the machine learning methods assume given data follow a balanced distribution of classes. The implemented classifier will derive false classification results if the problem is not solved. Therefore, this paper proposes a novel method, named as Gaussian-based SMOTE, to solve the problem by combining Gaussian distribution in a synthetic data generation process. It is confirmed that the proposed method could solve the class imbalance problem by conducting experiments with actual cases.
Keywords: Skewed class distribution, SMOTE, Gaussian random variable, Classification
Several essential conditions can make a reliable and accurate classifier including a sufficient amount of data. However, the collected data in real world frequently follows a skewed class distribution. In other words, the number of majority class data dominates the minority. Considering that most of the classification methods assume equally distributed dataset, the skewed distribution can cause a significant performance loss. It is called as a class imbalance problem. If the feature space of given dataset is a high-dimension, the problem severely makes the performance of classifier worse. Therefore, the classifier has no choice but to produce unreliable and poor classification results without pre-processing [1].
A lot of strategies have been proposed to solve the issue [2–5]. The strategies can be categorized into three representative approaches: random sampling [6, 7], algorithmic modification [8], and cost-sensitive learning [9, 10]. Among them, the random sampling based methods are popular choices, which derive numerical balance between the majority and the minority: the former decreases the number of majority class data, while the latter increases the minority class data. They allow classifiers to be learned from the given data without bias. However, the classical random sampling method selects samples by using random sampling with replacement, which has less influence to improve the performance of classifiers.
Therefore, more sophisticated models have been proposed to deal with the problem as mentioned earlier. Of that, synthetic minority oversampling technique (SMOTE) [11] has become one of the most renowned solutions to resolve the class imbalance problem. It creates synthetic data based on the feature space similarities between existing minority samples.
In other words, the SMOTE method generates synthetic samples based on a combination of
In this paper, we focus on the issue that the synthetic samples tend to be generated on the line between the minority samples. If there is a significant gap between the majority and the minority, an enormous amount of synthetic data needs to be created. It means that the synthetic data tends to be placed on the same line with high probability. It can be considered as one of the types of over-generalization problem. The proposed method, named as Gaussian-based SMOTE, can solve the problem by combining Gaussian probability distribution in the feature space. The Gaussian probability distribution can make the SMOTE algorithm to generate new artificial samples deviated in the line but not significantly.
The rest of paper is organized as follows: at first, Section 2 gives brief explanations of the SMOTE algorithm. After that, Section 3 introduces the proposed Gaussian-based SMOTE. Section 4 provides experimental results and conclusion is given in Section 5.
The SMOTE method generates synthetic data using
where
where
The SMOTE, Borderline-SMOTE, and safe-level-SMOTE algorithms generate synthetic data by utilizing a random number from a uniform distribution. However, it is possible to happen that more than one synthetic data can be created between specific minority class data and its particular nearest neighbor, which are frequently selected during the process. In other words, the synthetic data are placed on the same line with high probability, which could intensify the over-generation problem.
Therefore, we propose a novel SMOTE algorithm, named as Gaussian-based SMOTE, to allow the method assure more diversity while generating artificial samples. The basic underlying principles of the proposed algorithm are same as the SMOTE method: compute differences between the minority class data and its randomly selected nearest neighbor in
After that, the Gaussian-based SMOTE method draws a number between 0 and difference value for roughly estimating a location of a synthetic candidate as shown in
As a next step, another number draws from Gaussian (or normal) distribution as
Finally, a synthetic data is generated as
The overview of Gaussian-based SMOTE is described in Figure 2. By including the Gaussian probability distribution in the process, it is possible to expand the place, where the synthetic sample is generated, from the line between minorities. Also, the Gaussian distribution allows the synthetic data located near the line. It makes the algorithm reasonable because the position of too far from the line might occurs false results. The deviated location between the minority classes provides the classifier a wealth of information.
To verify the proposed Gaussian-based SMOTE method, we select a benchmark dataset first. Figure 3(a) indicates a seeds dataset from UCI Repository [16]. The seeds dataset consists of seven attributes and three classes. Each class is made of seventy samples. In this paper, we consider class 1 as the minority class, and others as the majority class. Before applying the SMOTE and the Gaussian-based SMOTE, we normalized the benchmark dataset from zero to one. As shown in Figure 3(a), the minority class samples are placed in the middle of the data distribution, and the majority class samples surrounding the minority class samples.
The synthetic sample generation results are described in Figure 3(b) and (c). Figure 3(b) is the result of using the SMOTE algorithm, and Figure 3(c) is the result of using the Gaussian-based SMOTE algorithm. As shown in the figures, the synthetic samples by the SMOTE algorithm seem to be generated in duplicated place. On the other hands, the synthetic samples by the Gaussian-based SMOTE algorithm seem to be of wide distribution nearby the minority samples. In conclusion, it is possible to consider that the proposed Gaussian-based SMOTE algorithm shows better performance than the SMOTE algorithm. To compare the performance numerically, we applied the support vector machine [17] as a classification method and comparing the performance by using the accuracy as shown in
where TP stands for true positive, TN for true negative, FP for false positive and FN for false negative. Also, we conducted the experiments five times and derived the average performance for considering randomness in the algorithm. When we use the original dataset, the accuracy of the support vector machine shows 87.30% accuracy. And using the artificial dataset by the SMOTE, the accuracy shows 89.11%. However, when we use the Gaussian-SMOTE based synthetic dataset, the accuracy shows 90.13%.
Also, we select an interesting topic which is related to weather forecasting to verify the proposed method by utilizing a real world application. It is essential to remove non-weather echo in the radar data for obtaining high accuracy and reliability. An anomalous propagation echo is one of the non-weather echoes when a weather radar performs its observation process. It could cause significant false prediction results in quantitative precipitation estimation. Moreover, it occurs rare and random which is possible to consider as a class imbalance problem.
For comparison, we conducted experiments with the support vector machine using imbalanced class dataset and balanced dataset by the SMOTE and the Gaussian-based SMOTE. Because we implemented a binary classification method, we selected accuracy, sensitivity, and specificity as performance evaluations shown in
An average accuracy using the Gaussian-based SMOTE is 87.45%, which is higher than 83.22% using imbalanced class dataset directly, and 85.66% using the SMOTE. Also, it turns out that the sensitivity and the specificity of the Gaussian-based SMOTE show 3% to 5% higher performance than others. Therefore, it is confirm that the proposed method could solve the class imbalance problem better than the SMOTE algorithm.
It is essential to provide sufficient amount of learning data in the implementation process of classifiers. However, learning samples from the real world include lots of problems including noise and skewed class distribution. The skewed class distribution also called the class imbalance problem, causes a considerable performance loss due to the underlying assumption of machine learning methods. Therefore, it is important to resolve the problem of obtaining classifiers with excellent performance. In this paper, we proposed a novel method named Gaussian-based SMOTE by combining Gaussian probability distribution and the SMOTE algorithm.
It is confirmed that the proposed Gaussian-based SMOTE algorithm shows better performance than the SMOTE algorithm by using the benchmark data and the real-world application. In future work, we will deal with several considerations. At first, more performance comparison experiments should be conducted by utilizing other SMOTE algorithms including Borderline-SMOTE and safe-level-SMOTE. By doing this, it is possible to see which algorithm is better at various point of views such as accuracy, computational time, and start looking a new direction to improve the proposed Gaussian-SMOTE algorithm. Second, we will implement a method how to set the appropriate hyper-parameters including
This work was supported by the Energy Efficiency & Resources Core Technology Program of the Korea Institute of Energy Technology Evaluation and Planning (KETEP) granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea (No. 20151110200040).
No potential conflict of interest relevant to this article was reported.
Email: hansoo@pusan.ac.kr
Email: wisekim@pusan.ac.kr
E-mail: sskim@pusan.ac.kr
Int. J. Fuzzy Log. Intell. Syst. 2017; 17(4): 229-234
Published online December 25, 2017 https://doi.org/10.5391/IJFIS.2017.17.4.229
Copyright © The Korean Institute of Intelligent Systems.
Hansoo Lee, Jonggeun Kim, and Sungshin Kim
Department of Electrical and Computer Engineering, Pusan National University, Busan, Korea
Correspondence to: Sungshin Kim (sskim@pusan.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Sufficient amount of learning data is an essential condition to implement a classifier with excellent performance. However, the obtained data usually follow a significantly biased distribution of classes. It is called a class imbalance problem, which is one of the frequently occurred issues in the real world applications. This problem causes a considerable performance drop because most of the machine learning methods assume given data follow a balanced distribution of classes. The implemented classifier will derive false classification results if the problem is not solved. Therefore, this paper proposes a novel method, named as Gaussian-based SMOTE, to solve the problem by combining Gaussian distribution in a synthetic data generation process. It is confirmed that the proposed method could solve the class imbalance problem by conducting experiments with actual cases.
Keywords: Skewed class distribution, SMOTE, Gaussian random variable, Classification
Several essential conditions can make a reliable and accurate classifier including a sufficient amount of data. However, the collected data in real world frequently follows a skewed class distribution. In other words, the number of majority class data dominates the minority. Considering that most of the classification methods assume equally distributed dataset, the skewed distribution can cause a significant performance loss. It is called as a class imbalance problem. If the feature space of given dataset is a high-dimension, the problem severely makes the performance of classifier worse. Therefore, the classifier has no choice but to produce unreliable and poor classification results without pre-processing [1].
A lot of strategies have been proposed to solve the issue [2–5]. The strategies can be categorized into three representative approaches: random sampling [6, 7], algorithmic modification [8], and cost-sensitive learning [9, 10]. Among them, the random sampling based methods are popular choices, which derive numerical balance between the majority and the minority: the former decreases the number of majority class data, while the latter increases the minority class data. They allow classifiers to be learned from the given data without bias. However, the classical random sampling method selects samples by using random sampling with replacement, which has less influence to improve the performance of classifiers.
Therefore, more sophisticated models have been proposed to deal with the problem as mentioned earlier. Of that, synthetic minority oversampling technique (SMOTE) [11] has become one of the most renowned solutions to resolve the class imbalance problem. It creates synthetic data based on the feature space similarities between existing minority samples.
In other words, the SMOTE method generates synthetic samples based on a combination of
In this paper, we focus on the issue that the synthetic samples tend to be generated on the line between the minority samples. If there is a significant gap between the majority and the minority, an enormous amount of synthetic data needs to be created. It means that the synthetic data tends to be placed on the same line with high probability. It can be considered as one of the types of over-generalization problem. The proposed method, named as Gaussian-based SMOTE, can solve the problem by combining Gaussian probability distribution in the feature space. The Gaussian probability distribution can make the SMOTE algorithm to generate new artificial samples deviated in the line but not significantly.
The rest of paper is organized as follows: at first, Section 2 gives brief explanations of the SMOTE algorithm. After that, Section 3 introduces the proposed Gaussian-based SMOTE. Section 4 provides experimental results and conclusion is given in Section 5.
The SMOTE method generates synthetic data using
where
where
The SMOTE, Borderline-SMOTE, and safe-level-SMOTE algorithms generate synthetic data by utilizing a random number from a uniform distribution. However, it is possible to happen that more than one synthetic data can be created between specific minority class data and its particular nearest neighbor, which are frequently selected during the process. In other words, the synthetic data are placed on the same line with high probability, which could intensify the over-generation problem.
Therefore, we propose a novel SMOTE algorithm, named as Gaussian-based SMOTE, to allow the method assure more diversity while generating artificial samples. The basic underlying principles of the proposed algorithm are same as the SMOTE method: compute differences between the minority class data and its randomly selected nearest neighbor in
After that, the Gaussian-based SMOTE method draws a number between 0 and difference value for roughly estimating a location of a synthetic candidate as shown in
As a next step, another number draws from Gaussian (or normal) distribution as
Finally, a synthetic data is generated as
The overview of Gaussian-based SMOTE is described in Figure 2. By including the Gaussian probability distribution in the process, it is possible to expand the place, where the synthetic sample is generated, from the line between minorities. Also, the Gaussian distribution allows the synthetic data located near the line. It makes the algorithm reasonable because the position of too far from the line might occurs false results. The deviated location between the minority classes provides the classifier a wealth of information.
To verify the proposed Gaussian-based SMOTE method, we select a benchmark dataset first. Figure 3(a) indicates a seeds dataset from UCI Repository [16]. The seeds dataset consists of seven attributes and three classes. Each class is made of seventy samples. In this paper, we consider class 1 as the minority class, and others as the majority class. Before applying the SMOTE and the Gaussian-based SMOTE, we normalized the benchmark dataset from zero to one. As shown in Figure 3(a), the minority class samples are placed in the middle of the data distribution, and the majority class samples surrounding the minority class samples.
The synthetic sample generation results are described in Figure 3(b) and (c). Figure 3(b) is the result of using the SMOTE algorithm, and Figure 3(c) is the result of using the Gaussian-based SMOTE algorithm. As shown in the figures, the synthetic samples by the SMOTE algorithm seem to be generated in duplicated place. On the other hands, the synthetic samples by the Gaussian-based SMOTE algorithm seem to be of wide distribution nearby the minority samples. In conclusion, it is possible to consider that the proposed Gaussian-based SMOTE algorithm shows better performance than the SMOTE algorithm. To compare the performance numerically, we applied the support vector machine [17] as a classification method and comparing the performance by using the accuracy as shown in
where TP stands for true positive, TN for true negative, FP for false positive and FN for false negative. Also, we conducted the experiments five times and derived the average performance for considering randomness in the algorithm. When we use the original dataset, the accuracy of the support vector machine shows 87.30% accuracy. And using the artificial dataset by the SMOTE, the accuracy shows 89.11%. However, when we use the Gaussian-SMOTE based synthetic dataset, the accuracy shows 90.13%.
Also, we select an interesting topic which is related to weather forecasting to verify the proposed method by utilizing a real world application. It is essential to remove non-weather echo in the radar data for obtaining high accuracy and reliability. An anomalous propagation echo is one of the non-weather echoes when a weather radar performs its observation process. It could cause significant false prediction results in quantitative precipitation estimation. Moreover, it occurs rare and random which is possible to consider as a class imbalance problem.
For comparison, we conducted experiments with the support vector machine using imbalanced class dataset and balanced dataset by the SMOTE and the Gaussian-based SMOTE. Because we implemented a binary classification method, we selected accuracy, sensitivity, and specificity as performance evaluations shown in
An average accuracy using the Gaussian-based SMOTE is 87.45%, which is higher than 83.22% using imbalanced class dataset directly, and 85.66% using the SMOTE. Also, it turns out that the sensitivity and the specificity of the Gaussian-based SMOTE show 3% to 5% higher performance than others. Therefore, it is confirm that the proposed method could solve the class imbalance problem better than the SMOTE algorithm.
It is essential to provide sufficient amount of learning data in the implementation process of classifiers. However, learning samples from the real world include lots of problems including noise and skewed class distribution. The skewed class distribution also called the class imbalance problem, causes a considerable performance loss due to the underlying assumption of machine learning methods. Therefore, it is important to resolve the problem of obtaining classifiers with excellent performance. In this paper, we proposed a novel method named Gaussian-based SMOTE by combining Gaussian probability distribution and the SMOTE algorithm.
It is confirmed that the proposed Gaussian-based SMOTE algorithm shows better performance than the SMOTE algorithm by using the benchmark data and the real-world application. In future work, we will deal with several considerations. At first, more performance comparison experiments should be conducted by utilizing other SMOTE algorithms including Borderline-SMOTE and safe-level-SMOTE. By doing this, it is possible to see which algorithm is better at various point of views such as accuracy, computational time, and start looking a new direction to improve the proposed Gaussian-SMOTE algorithm. Second, we will implement a method how to set the appropriate hyper-parameters including
This work was supported by the Energy Efficiency & Resources Core Technology Program of the Korea Institute of Energy Technology Evaluation and Planning (KETEP) granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea (No. 20151110200040).
Flowchart of the SMOTE algorithm.
The operation principle of the Gaussian-based SMOTE algorithm.
Benchmark dataset and synthetic samples: (a) given dataset, (b) SMOTE, (c) Gaussian-based SMOTE.
Mediana Aryuni, Suko Adiarto, Eka Miranda, Evaristus Didik Madyatmadja, Albert Verasius Dian Sano, and Elvin Sestomi
International Journal of Fuzzy Logic and Intelligent Systems 2023; 23(2): 140-151 https://doi.org/10.5391/IJFIS.2023.23.2.140Nishant Chauhan and Byung-Jae Choi
International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(4): 349-357 https://doi.org/10.5391/IJFIS.2021.21.4.349Jihad Anwar Qadir, Abdulbasit K. Al-Talabani, and Hiwa A. Aziz
International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(4): 272-277 https://doi.org/10.5391/IJFIS.2020.20.4.272Flowchart of the SMOTE algorithm.
|@|~(^,^)~|@|The operation principle of the Gaussian-based SMOTE algorithm.
|@|~(^,^)~|@|Benchmark dataset and synthetic samples: (a) given dataset, (b) SMOTE, (c) Gaussian-based SMOTE.