International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(4): 346-357
Published online December 25, 2020
https://doi.org/10.5391/IJFIS.2020.20.4.346
© The Korean Institute of Intelligent Systems
Jiarui Li^{1}, Yukio Horiguchi^{2}, and Tetsuo Sawaragi^{1}
^{1}Department of Mechanical Engineering and Science, Graduate School of Engineering, Kyoto University, Kyoto, Japan
^{2}Faculty of Informatics, Kansai University, Osaka, Japan
Correspondence to :
Jiarui Li (ljr10225008@gmail.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Fuzzy C-means (FCM) has a definite limitation when partitioning a dataset into clusters with varying sizes and densities because it ignores the scale difference in different dimensions of input data objects. To alleviate this cluster size insensitivity, we propose a wrapper algorithm for FCM by introducing cluster size as a priori information and limiting the search direction on the basis of density benchmarks (CSCD-FCM). This method is divided into two stages. The first stage adjusts the position of each cluster while maintaining its shape, and the second stage changes the shape of each cluster while maintaining its center. Both steps modify fuzzy partitions generated by FCM-like soft clustering methods by optimizing a “size-constrained” objective function. Numerical and practical experiments with unbalanced cluster size settings demonstrate the effectiveness of this method for extracting actual cluster structures, as well as achieving the desired cluster populations.
Keywords: Fuzzy C-means, Clustering, Cluster size insensitivity
Clustering is a method for grouping data objects into clusters, such that objects in the same cluster have similar characteristics, whereas those in different clusters have disparate characteristics [1]. However, there is no absolute rule for defining cluster characteristics, which is why different clustering techniques have been proposed on the basis of various inclusion principles [2]. According to Fraley and Raftery [3], clustering methods can be categorized as hard clustering or soft (fuzzy) clustering. Hard clustering methods, such as K-means, strictly allocate each object to only one cluster, so these methods require clear cluster boundaries. However, in real life, data objects may naturally belong to either single or multiple groups, which limits the application of hard clustering [4]. Therefore, fuzzy clustering techniques have been developed [5]. Fuzzy clustering is more flexible because it allows every object to belong to more than one cluster [6]. Fuzzy C-means (FCM) is a widely used partitioning-based fuzzy clustering method that divides a set of data objects into fuzzy partitions by minimizing the intra-cluster variance and maximizing the inter-cluster variance in accordance with an objective function [7].
Because of the flexibility involved with acting on overlapped datasets, FCM and its expansions have been widely used in diverse fields such as image processing and market segmentation [8–10].
However, a common failure of partition-based clustering methods is their inability to deal with datasets of varying cluster size and data density, which we call the “size-insensitivity problem.” As the most commonly used partition-based clustering method, FCM is undoubtedly not immune to this issue [11].
For one thing, as a Euclidean distance-based method, FCM ignores the scale difference in different dimensions of input data objects, resulting in weak discernment of diverse-distribution data structures. To resolve this defect, Mahalanobis distance has been added to the original FCM by many researchers [12–14]. Using the covariance matrix to evaluate the correlations among different dimensions makes such FCM extensions better at extracting data structures with various distributions. However, they sometimes fail when the difference in the distribution is small. In a similar way, Gaussian-based models are also good at distinguishing the distinctions of data distributions [15]. Ichihashi et al. [16] found that when setting a fuzzifier at a particular value, FCM achieves the same effect as Gaussian mixture density decomposition (GMDD) by regularizing the K-L information (KFCM) [17, 18]. KFCM is a partition-based improvement that solves the size-insensitive problem to some degree, but it is limited by the boundary conditions and cannot deal with clusters firmly located next to each other.
Moreover, because FCM uses a sum-of-squared-error objective function to obtain solutions, it tends to drift centers of smaller clusters toward more massive adjacent clusters and to equalize cluster populations. To overcome this problem, the following algorithms have been proposed: fuzzy maximum likelihood estimation (FMLE) [19], semi-supervised FCM (ss-FCM) [20], cluster-size-insensitive FCM (CSI-FCM) [21], and size-insensitive integrity-based FCM (SIIB-FCM) [22]. Lin et al. [22] showed that SIIB-FCM is superior to the other three, so we will not repeat the details here. However, SIIB-FCM is quite sensitive to the distances among clusters, which means it is only good at handling datasets with relatively clear cluster boundaries. This shortcoming is apparent when dealing with practical issues, such as removing small amounts of noise mixed in with the raw signal.
Considering the weaknesses mentioned above, a more effective method is needed to solve the size-insensitivity problem of FCM. In an actual application, one can sometimes obtain background information concerning data objects, such as the expected number of data groups and the approximate size of each group, from preliminary surveys, prior studies, or the opinions of domain experts in actual applications of data clustering. The proportion of each cluster size, as well as the number of clusters, may well be available as a priori knowledge for clustering. Clustering is based on a similarity measure to group semblable data objects together, so it is commonly utilized in areas such as market segmentation, vehicle routing selection, and healthcare problems [23, 24]. On these occasions, users employ certain conditions for various classification purposes. As a result, if such additional information is introduced as a priori information, it will likely help clustering algorithms find the intrinsic structure behind observations.
In our previous studies [25, 26], we developed an algorithm for FCM to assimilate information regarding the desired cluster sizes by refining the cluster membership functions. This method has demonstrated advantages over SIIB-FCM. In this paper, we extend the wrapper algorithm by defining more convincing adjustment directions by introducing the Mahalanobis distance.
The significant contributions of this study are as follows. 1) The present paper proposes a new, improved fuzzy clustering method to solve the size-insensitivity problem. 2) The proposed method is good at extracting data structures, regardless of the cluster numbers or data dimensions, and can deal with multi-distribution datasets. 3) Unlike other similar algorithms, the proposed CSCD-FCM is not sensitive to cluster distance, which widens its application range. 4) The proposed method can correct the error input of the priority information itself, so its application is soft and flexible.
The remainder of this paper is organized as follows. In Section 2, we review the FCM. Section 3 proposes a new algorithm to assimilate a priori knowledge of the cluster size proportions into fuzzy clustering. Section 4 presents experiments to show the advantages of the proposed method compared with other algorithms, and Section 5 discusses the results. Finally, concluding remarks are presented in Section 6.
Given a dataset that contains
where
In contrast, the cluster center
As mentioned in Section 1, FCM attempts to minimize the objective function
The proposed method aims to adjust multi-dimensional fuzzy clusters to attain the desired population sizes. We assume that the desired proportion of cluster sizes is available as a priori knowledge for the clustering problem in question. In this premise, it is hypothesized that as the calculated cluster size approaches the given cluster population, the data structure that will be extracted is more suitable. To this end, the proposed algorithm modifies the fuzzy partitions generated by FCM-like soft clustering methods by optimizing a “size-constrained” objective function, as shown in function
Here,
The method is divided into two stages. The first stage adjusts the position of each cluster while maintaining its shape, and the second stage adjusts the shape of each cluster while maintaining its center position. For a clear understanding, we present the specific procedures of the adjustment in the next section.
As mentioned above, the proposed method uses two-stage adjustments intended to be applied to fuzzy partitions generated by other soft clustering methods. Both stages aim to optimize function
After initializing the clustering solution, the adjustment procedure for each cluster is followed, and steps 2 to 4 are repeated. The order should be from the largest cluster to the smallest one. After completing the adjustment for one cluster, the data belonging to the adjusted cluster from the input that obtained the highest membership moves to the target cluster and the remainder of the dataset moves to step 2 as the new input for finding the density center of the next cluster. The memberships used for removing data come from the new membership matrix of the adjusted cluster.
For a more fundamental understanding, we use the following example to expound.
As shown in Figure 2, the input dataset contains three clusters with discrepant data populations of 100/1500/50. The following sections describe in detail the steps mentioned above using this example.
Given a dataset that consists of
expresses the cluster prototypes, where
In addition, according to function
The subsequent stages of the proposed method regard the cluster center matrix
The “cluster center” is the arithmetic mean of all the points belonging to the cluster. Each point is closer to its cluster center than to the other cluster centers. The original FCM uses Euclidean distance as the only measurement standard for the objective function, causing it to drive smaller cluster centers closer to the larger ones. This is why an equal tumble occurs. As shown in Figure 4, it is obvious that point x is closer to the center of cluster 3 than cluster 2. Thus, FCM makes an erroneous judgment on data point x by placing it in cluster 2.
The limited consideration of only one standard leads to a wrong answer for clustering. To determine the most suitable positions of cluster centers, we introduce the cluster size
The density reflects the closeness of the data near each other in a certain area, and it can be calculated as
First, a Mahalanobis distance matrix is created using all input objects
In function
where
For each distance vector
Here, what should be given attention is that the position of density prototype
The first stage of adjustment is intended to determine the desired position of each cluster to obtain the target cluster sizes. In this stage, the algorithm explores new positions for both the peak and floor points of each cluster under constraints that limit the search directions.
We first duplicate the cluster center matrix
In Figure 5, the adjusting direction of cluster 2 is from
As the search direction and the interval are specified, the algorithm attempts to minimize objective function
The second stage of adjusting the membership functions changes the shape of each cluster to minimize objective function
Given peak/floor matrixes
As shown in Figure 6, during the shape adjustment procedure, the peak
The result of the two-stage adjustment is a set of updated peak/floor matrixes
At the same time, we combine the
Figure 7 shows the results of
We use numerical and practical experiments to examine the effectiveness of the proposed method in comparison with FCM, SIIB-FCM, and KL-FCM. To test the stronger points of the proposed algorithm, we conducted four experiments. The aims and contents of the four experiments are shown in Table 1.
In this experiment, a numerical dataset is prepared to test the correct clustering power of the proposed method by evaluating the error rate of the data population classified into each cluster and the accuracy of the clustering results. Additionally, for the numerical dataset, we also evaluated the indexes typically used for soft clustering methods: Dunn’s index (DI) and the Xie-Beni index (XB).
The experiment used artificial datasets with different cluster populations. Section 3 uses a three-cluster dataset as an example to explain the algorithm. Here, we created another dataset to test the effectiveness of the proposed algorithm under different data distributions. Figure 8 shows the original distributions of the input dataset.
For this test, there are three cross-test datasets, and the size of the cluster is set as 2000/50. Figure 9 shows the results.
KL-FCM, a segment-based clustering method using the Mahalanobis distance, highlights its advantages when input data have diverse distributions. The proposed algorithm also draws on the advantage of the Mahalonobis distance in the stage of finding the “density center.” Thus, CSCD-FCM is tolerant to multi-distribution data to some degree.
Tables 2 and 3 show the related evaluation indices.
Here, SIIB-FCM shows the lowest XB value on this occasion, which is primarily a result of the reduction of memberships for all the data. However, compared with the accuracy of extracting the exact data structure, the decrease in the XB index does not make sense. Considering the size-insensitivity problem, CSCD-FCM performs best here.
SIIB-FCM and KL-FCM were proposed to improve FCM performance on a clustering dataset with unbalanced cluster sizes and different data distributions. Both methods perform well when clusters are sufficiently far apart, but not when clusters are close together. We call this the “distance-sensitivity problem.” This problem commonly occurs in unsupervised clustering methods, especially when the cluster sizes are unbalanced. The following experiments focus on the distance-sensitivity problem, and the results show that the proposed method has a high tolerance with respect to the compactness of clusters.
In this experiment, different datasets were generated for two circular clusters using normal-distribution data with cluster sizes of 2000/50 and various cluster distances, and a three-dataset validation was conducted for each kind of input. The center distance of the circle measures the distance between clusters.
We used the accuracy and F1_score to evaluate the clustering possibility of the four algorithms under close to far cluster distances. Figures 10 and 11 show the accuracy and F1_scores, respectively, of FCM, SIIB, KL-FCM, and CSCD-FCM with various cluster distances.
The accuracy and F1_score reflect the degree of correct classification. Figures 10 and 11 show that the size-insensitivity problem exists in FCM, regardless of the distances among the clusters. When the interval between the two clusters is sufficiently large, e.g., over 4000, as shown in this example, KL-FCM and SIIB-FCM can obtain better results than FCM. However, when the clusters are closer together, KL-FCM and SIIB-FCM sometimes perform worse than the original FCM.
In contrast, the proposed algorithm has almost no sensitivity to distance. This means that even if the data distribution is very tight, the clustering result is still good. This strong point may make CSCD-FCM more widely applicable than similar algorithms.
This experiment is aimed at testing whether the proposed algorithm can obtain adequate results when the cluster size information contains errors. The datasets used in this experiment were still generated in two circular clusters using normal-distribution data with cluster sizes of 2000/50. We evaluated the accuracy and F1_score by fixing one cluster’s size, which is the correct size information, and reducing or increasing the other clusters’ sizes by a certain percentage of the actual size. Figures 12 and 13 show the results.
The results show that the proposed algorithm is robust to the error input of size information. The algorithm still obtains both accuracy and F1_score greater than 0.95 when the given information reduces the larger cluster to 70% of its actual size and increases the smaller cluster to 800% of its actual size. Thus, the proposed algorithm has a degree of tolerance to the “prior information error.” Considering this, when using the algorithm to deal with a practical problem, the requirements for prior input knowledge regarding size information are not strict. An inevitable ambiguous error is permitted.
In this experiment, we conducted a practical application on a healthcare problem. Obstructive sleep apnea (OSA) is a common sleep disorder disease, especially in elderly individuals [27]. According to the most recent survey from the American Academy of Sleep Medicine (AASM), the sleep health research authority, 71.9% of the general population aged 40–85 years suffers from this disease [27]. Because people are unconscious when sleeping, it is difficult to discover OSA in the primary care stage. A commonly used means for self-diagnosing OSA is the screening tool, and the STOP-Bang questionnaire is the most well-known [28].
There are eight questions in the STOP-Bang questionnaire. We collected 3,931 people’s STOP-Bang data and Apnea-Hypopnea Index (AHI) from the Sleep Heart Health Study dataset [29], where AHI is regarded as the gold standard indicator for diagnosing OSA. The participants’ age ranged from 39 to 95 years. None of these 3,931 people had been diagnosed with OSA by a medical doctor or accepted an OSA treatment before collecting the data. An AHI ≥ 5 is a critical value to judge OSA, and there are 2,730 objects, occupying approximately 70% of the 3,931 people who should be diagnosed with OSA. We applied the algorithm to 8-dimensional input data reflecting the eight questions from the 3,931 objects with the prior knowledge of 2,730 objects for the positive cluster and 1,201 objects for the negative cluster. The clustering results were evaluated by comparing the AHI value judgments. For a healthcare problem, medical doctors are generally more concerned more about the sensitivity of the positive rate. Hence, this index is introduced as another evaluation indicator, in addition to accuracy and F1_score. The clustering results using the four methods are shown in Table 4.
The Cp in Table 4 presents a positive cluster, and Cp size is recorded as the data population classified into the positive cluster (differences from the true cluster size, 2,730). Table 4 clearly shows that with the constraint of the prior knowledge, the difference between the data populations classified into each cluster of the proposed method and the given size is smallest, and the accuracy and sensitivity are also much higher than those of the other methods.
The size-insensitivity problem is common in objective function-based clustering methods such as FCM. FCM clusters data objects by minimizing the sum of Euclidean distance errors among data, which causes data objects at the edge of larger clusters to drag the center of adjacent smaller clusters toward itself. This movement leads to the clustering results of FCM often having balanced sizes.
To solve this size-insensitivity problem, improved methods such as SIIB-FCM and KL-FCM introduce extra information, such as data populations and distribution characteristics, to correct the shortcomings caused by the objective functions based on the Euclidean distance. Nevertheless, the only objective function that these methods attempt to optimize is still the sum of variance errors of all data points, which leads to them being sensitive to the distance between neighboring clusters. This kind of simultaneous use of a variety of information causes different types of information to interfere with each other; thus, the algorithms cannot fully mine all the information carried by the data. The proposed method divides the optimization procedure into two stages: one for variance error optimization, and the other for extra-knowledge optimization. The segmented use of the objective functions helps the proposed algorithm better utilize the necessary information hidden in the data.
Additionally, attaching extra knowledge as clustering data objects may cause unnecessary biases in the original data structure. For one thing, the a priori knowledge utilized by the algorithm is not groundless rumors. It must be the knowledge carried by the input data objects, and the use of this knowledge should not affect the hidden information in the dataset that the algorithm hopes to extract. In addition, the proposed algorithm has some tolerance to errors in the input information of cluster size, which will help reduce the influence of the biases in application cases.
The size-insensitivity problem is one of the significant shortcomings of traditional FCM and its variations, resulting in a weak clustering probability when dealing with unbalanced datasets. Thus, this paper proposes an improved fuzzy clustering method based on the assumption of obtaining “cluster sizes” as a priori information and subjoins a size objective function to take advantage of the information lost in traditional FCM. The algorithm contains a two-stage adjustment. One is the position adjustment, to find the most suitable location for each cluster. The other is a further adjustment that continues to optimize the size objective function by changing the cluster shape. Additionally, utilizing the Mahalanobis distance to define the moving directions during the adjustment procedure enhances the capacity of the algorithm to deal with multi-distribution datasets. Compared with other algorithms aimed at solving the same problem, such as KL-FCM and SIIB-FCM, the proposed method can extract the actual data distribution more correctly and has a high tolerance to cluster distance. Additionally, the proposed CSCD-FCM can offer the correct clustering solution, even when the input information contains errors.
In this study, the proposed method is proven to have the ability to solve clustering problems with different distributions of data to some degree. However, when the cluster shape becomes concave, the algorithm performs unsatisfactorily. In the future, we would like to combine the CSCD-FCM with fuzzy variation strength in dealing with arbitrary shape clusters to increase its application range.
No potential conflict of interest relevant to this article was reported.
Clustering results using the four algorithms: (a) FCM, (b) SIIB-FCM, (c) KL-FCM, and (d) CSCD-FCM.
Table 1. Aims and contents of the four experiments.
Contents | |
---|---|
Experiment 1 | Data structure extraction test |
Experiment 2 | Distance tolerance test |
Experiment 3 | Robustness test |
Experiment 4 | Practical example of a healthcare problem |
Table 2. Cluster size results for two clusters with different distributions.
Method | Cluster 1 | Cluster 2 | ||
---|---|---|---|---|
Size | Difference | Size | Difference | |
FCM | 1177 | 823 | 923 | −823 |
SIIB-FCM | 807 | 1193 | 1293 | −1193 |
KL-FCM | 1582 | 418 | 518 | −418 |
CSCD-FCM |
Table 3. Evaluation indices for two clusters with different distributions.
Method | Accuracy | F1_score | DI | XB |
---|---|---|---|---|
0.6080 | 0.6528 | 0.0043 | 0.1532 | |
0.4321 | 0.6095 | 0.0027 | 0.0428 | |
0.7171 | 0.6922 | 0.0016 | 72.3254 | |
Table 4. Evaluation indices for the practical example.
Method | Accuracy | F1_score | Sensitivity | Cp size |
---|---|---|---|---|
0.3885 | 0.3531 | 0.4608 | 2190(−540) | |
0.5785 | 0.5517 | 0.6150 | 2285(−445) | |
0.4861 | 0.5523 | 0.3795 | 1362(−1368) | |
E-mail: ljr10225008@gmail.com
International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(4): 346-357
Published online December 25, 2020 https://doi.org/10.5391/IJFIS.2020.20.4.346
Copyright © The Korean Institute of Intelligent Systems.
Jiarui Li^{1}, Yukio Horiguchi^{2}, and Tetsuo Sawaragi^{1}
^{1}Department of Mechanical Engineering and Science, Graduate School of Engineering, Kyoto University, Kyoto, Japan
^{2}Faculty of Informatics, Kansai University, Osaka, Japan
Correspondence to:Jiarui Li (ljr10225008@gmail.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Fuzzy C-means (FCM) has a definite limitation when partitioning a dataset into clusters with varying sizes and densities because it ignores the scale difference in different dimensions of input data objects. To alleviate this cluster size insensitivity, we propose a wrapper algorithm for FCM by introducing cluster size as a priori information and limiting the search direction on the basis of density benchmarks (CSCD-FCM). This method is divided into two stages. The first stage adjusts the position of each cluster while maintaining its shape, and the second stage changes the shape of each cluster while maintaining its center. Both steps modify fuzzy partitions generated by FCM-like soft clustering methods by optimizing a “size-constrained” objective function. Numerical and practical experiments with unbalanced cluster size settings demonstrate the effectiveness of this method for extracting actual cluster structures, as well as achieving the desired cluster populations.
Keywords: Fuzzy C-means, Clustering, Cluster size insensitivity
Clustering is a method for grouping data objects into clusters, such that objects in the same cluster have similar characteristics, whereas those in different clusters have disparate characteristics [1]. However, there is no absolute rule for defining cluster characteristics, which is why different clustering techniques have been proposed on the basis of various inclusion principles [2]. According to Fraley and Raftery [3], clustering methods can be categorized as hard clustering or soft (fuzzy) clustering. Hard clustering methods, such as K-means, strictly allocate each object to only one cluster, so these methods require clear cluster boundaries. However, in real life, data objects may naturally belong to either single or multiple groups, which limits the application of hard clustering [4]. Therefore, fuzzy clustering techniques have been developed [5]. Fuzzy clustering is more flexible because it allows every object to belong to more than one cluster [6]. Fuzzy C-means (FCM) is a widely used partitioning-based fuzzy clustering method that divides a set of data objects into fuzzy partitions by minimizing the intra-cluster variance and maximizing the inter-cluster variance in accordance with an objective function [7].
Because of the flexibility involved with acting on overlapped datasets, FCM and its expansions have been widely used in diverse fields such as image processing and market segmentation [8–10].
However, a common failure of partition-based clustering methods is their inability to deal with datasets of varying cluster size and data density, which we call the “size-insensitivity problem.” As the most commonly used partition-based clustering method, FCM is undoubtedly not immune to this issue [11].
For one thing, as a Euclidean distance-based method, FCM ignores the scale difference in different dimensions of input data objects, resulting in weak discernment of diverse-distribution data structures. To resolve this defect, Mahalanobis distance has been added to the original FCM by many researchers [12–14]. Using the covariance matrix to evaluate the correlations among different dimensions makes such FCM extensions better at extracting data structures with various distributions. However, they sometimes fail when the difference in the distribution is small. In a similar way, Gaussian-based models are also good at distinguishing the distinctions of data distributions [15]. Ichihashi et al. [16] found that when setting a fuzzifier at a particular value, FCM achieves the same effect as Gaussian mixture density decomposition (GMDD) by regularizing the K-L information (KFCM) [17, 18]. KFCM is a partition-based improvement that solves the size-insensitive problem to some degree, but it is limited by the boundary conditions and cannot deal with clusters firmly located next to each other.
Moreover, because FCM uses a sum-of-squared-error objective function to obtain solutions, it tends to drift centers of smaller clusters toward more massive adjacent clusters and to equalize cluster populations. To overcome this problem, the following algorithms have been proposed: fuzzy maximum likelihood estimation (FMLE) [19], semi-supervised FCM (ss-FCM) [20], cluster-size-insensitive FCM (CSI-FCM) [21], and size-insensitive integrity-based FCM (SIIB-FCM) [22]. Lin et al. [22] showed that SIIB-FCM is superior to the other three, so we will not repeat the details here. However, SIIB-FCM is quite sensitive to the distances among clusters, which means it is only good at handling datasets with relatively clear cluster boundaries. This shortcoming is apparent when dealing with practical issues, such as removing small amounts of noise mixed in with the raw signal.
Considering the weaknesses mentioned above, a more effective method is needed to solve the size-insensitivity problem of FCM. In an actual application, one can sometimes obtain background information concerning data objects, such as the expected number of data groups and the approximate size of each group, from preliminary surveys, prior studies, or the opinions of domain experts in actual applications of data clustering. The proportion of each cluster size, as well as the number of clusters, may well be available as a priori knowledge for clustering. Clustering is based on a similarity measure to group semblable data objects together, so it is commonly utilized in areas such as market segmentation, vehicle routing selection, and healthcare problems [23, 24]. On these occasions, users employ certain conditions for various classification purposes. As a result, if such additional information is introduced as a priori information, it will likely help clustering algorithms find the intrinsic structure behind observations.
In our previous studies [25, 26], we developed an algorithm for FCM to assimilate information regarding the desired cluster sizes by refining the cluster membership functions. This method has demonstrated advantages over SIIB-FCM. In this paper, we extend the wrapper algorithm by defining more convincing adjustment directions by introducing the Mahalanobis distance.
The significant contributions of this study are as follows. 1) The present paper proposes a new, improved fuzzy clustering method to solve the size-insensitivity problem. 2) The proposed method is good at extracting data structures, regardless of the cluster numbers or data dimensions, and can deal with multi-distribution datasets. 3) Unlike other similar algorithms, the proposed CSCD-FCM is not sensitive to cluster distance, which widens its application range. 4) The proposed method can correct the error input of the priority information itself, so its application is soft and flexible.
The remainder of this paper is organized as follows. In Section 2, we review the FCM. Section 3 proposes a new algorithm to assimilate a priori knowledge of the cluster size proportions into fuzzy clustering. Section 4 presents experiments to show the advantages of the proposed method compared with other algorithms, and Section 5 discusses the results. Finally, concluding remarks are presented in Section 6.
Given a dataset that contains
where
In contrast, the cluster center
As mentioned in Section 1, FCM attempts to minimize the objective function
The proposed method aims to adjust multi-dimensional fuzzy clusters to attain the desired population sizes. We assume that the desired proportion of cluster sizes is available as a priori knowledge for the clustering problem in question. In this premise, it is hypothesized that as the calculated cluster size approaches the given cluster population, the data structure that will be extracted is more suitable. To this end, the proposed algorithm modifies the fuzzy partitions generated by FCM-like soft clustering methods by optimizing a “size-constrained” objective function, as shown in function
Here,
The method is divided into two stages. The first stage adjusts the position of each cluster while maintaining its shape, and the second stage adjusts the shape of each cluster while maintaining its center position. For a clear understanding, we present the specific procedures of the adjustment in the next section.
As mentioned above, the proposed method uses two-stage adjustments intended to be applied to fuzzy partitions generated by other soft clustering methods. Both stages aim to optimize function
After initializing the clustering solution, the adjustment procedure for each cluster is followed, and steps 2 to 4 are repeated. The order should be from the largest cluster to the smallest one. After completing the adjustment for one cluster, the data belonging to the adjusted cluster from the input that obtained the highest membership moves to the target cluster and the remainder of the dataset moves to step 2 as the new input for finding the density center of the next cluster. The memberships used for removing data come from the new membership matrix of the adjusted cluster.
For a more fundamental understanding, we use the following example to expound.
As shown in Figure 2, the input dataset contains three clusters with discrepant data populations of 100/1500/50. The following sections describe in detail the steps mentioned above using this example.
Given a dataset that consists of
expresses the cluster prototypes, where
In addition, according to function
The subsequent stages of the proposed method regard the cluster center matrix
The “cluster center” is the arithmetic mean of all the points belonging to the cluster. Each point is closer to its cluster center than to the other cluster centers. The original FCM uses Euclidean distance as the only measurement standard for the objective function, causing it to drive smaller cluster centers closer to the larger ones. This is why an equal tumble occurs. As shown in Figure 4, it is obvious that point x is closer to the center of cluster 3 than cluster 2. Thus, FCM makes an erroneous judgment on data point x by placing it in cluster 2.
The limited consideration of only one standard leads to a wrong answer for clustering. To determine the most suitable positions of cluster centers, we introduce the cluster size
The density reflects the closeness of the data near each other in a certain area, and it can be calculated as
First, a Mahalanobis distance matrix is created using all input objects
In function
where
For each distance vector
Here, what should be given attention is that the position of density prototype
The first stage of adjustment is intended to determine the desired position of each cluster to obtain the target cluster sizes. In this stage, the algorithm explores new positions for both the peak and floor points of each cluster under constraints that limit the search directions.
We first duplicate the cluster center matrix
In Figure 5, the adjusting direction of cluster 2 is from
As the search direction and the interval are specified, the algorithm attempts to minimize objective function
The second stage of adjusting the membership functions changes the shape of each cluster to minimize objective function
Given peak/floor matrixes
As shown in Figure 6, during the shape adjustment procedure, the peak
The result of the two-stage adjustment is a set of updated peak/floor matrixes
At the same time, we combine the
Figure 7 shows the results of
We use numerical and practical experiments to examine the effectiveness of the proposed method in comparison with FCM, SIIB-FCM, and KL-FCM. To test the stronger points of the proposed algorithm, we conducted four experiments. The aims and contents of the four experiments are shown in Table 1.
In this experiment, a numerical dataset is prepared to test the correct clustering power of the proposed method by evaluating the error rate of the data population classified into each cluster and the accuracy of the clustering results. Additionally, for the numerical dataset, we also evaluated the indexes typically used for soft clustering methods: Dunn’s index (DI) and the Xie-Beni index (XB).
The experiment used artificial datasets with different cluster populations. Section 3 uses a three-cluster dataset as an example to explain the algorithm. Here, we created another dataset to test the effectiveness of the proposed algorithm under different data distributions. Figure 8 shows the original distributions of the input dataset.
For this test, there are three cross-test datasets, and the size of the cluster is set as 2000/50. Figure 9 shows the results.
KL-FCM, a segment-based clustering method using the Mahalanobis distance, highlights its advantages when input data have diverse distributions. The proposed algorithm also draws on the advantage of the Mahalonobis distance in the stage of finding the “density center.” Thus, CSCD-FCM is tolerant to multi-distribution data to some degree.
Tables 2 and 3 show the related evaluation indices.
Here, SIIB-FCM shows the lowest XB value on this occasion, which is primarily a result of the reduction of memberships for all the data. However, compared with the accuracy of extracting the exact data structure, the decrease in the XB index does not make sense. Considering the size-insensitivity problem, CSCD-FCM performs best here.
SIIB-FCM and KL-FCM were proposed to improve FCM performance on a clustering dataset with unbalanced cluster sizes and different data distributions. Both methods perform well when clusters are sufficiently far apart, but not when clusters are close together. We call this the “distance-sensitivity problem.” This problem commonly occurs in unsupervised clustering methods, especially when the cluster sizes are unbalanced. The following experiments focus on the distance-sensitivity problem, and the results show that the proposed method has a high tolerance with respect to the compactness of clusters.
In this experiment, different datasets were generated for two circular clusters using normal-distribution data with cluster sizes of 2000/50 and various cluster distances, and a three-dataset validation was conducted for each kind of input. The center distance of the circle measures the distance between clusters.
We used the accuracy and F1_score to evaluate the clustering possibility of the four algorithms under close to far cluster distances. Figures 10 and 11 show the accuracy and F1_scores, respectively, of FCM, SIIB, KL-FCM, and CSCD-FCM with various cluster distances.
The accuracy and F1_score reflect the degree of correct classification. Figures 10 and 11 show that the size-insensitivity problem exists in FCM, regardless of the distances among the clusters. When the interval between the two clusters is sufficiently large, e.g., over 4000, as shown in this example, KL-FCM and SIIB-FCM can obtain better results than FCM. However, when the clusters are closer together, KL-FCM and SIIB-FCM sometimes perform worse than the original FCM.
In contrast, the proposed algorithm has almost no sensitivity to distance. This means that even if the data distribution is very tight, the clustering result is still good. This strong point may make CSCD-FCM more widely applicable than similar algorithms.
This experiment is aimed at testing whether the proposed algorithm can obtain adequate results when the cluster size information contains errors. The datasets used in this experiment were still generated in two circular clusters using normal-distribution data with cluster sizes of 2000/50. We evaluated the accuracy and F1_score by fixing one cluster’s size, which is the correct size information, and reducing or increasing the other clusters’ sizes by a certain percentage of the actual size. Figures 12 and 13 show the results.
The results show that the proposed algorithm is robust to the error input of size information. The algorithm still obtains both accuracy and F1_score greater than 0.95 when the given information reduces the larger cluster to 70% of its actual size and increases the smaller cluster to 800% of its actual size. Thus, the proposed algorithm has a degree of tolerance to the “prior information error.” Considering this, when using the algorithm to deal with a practical problem, the requirements for prior input knowledge regarding size information are not strict. An inevitable ambiguous error is permitted.
In this experiment, we conducted a practical application on a healthcare problem. Obstructive sleep apnea (OSA) is a common sleep disorder disease, especially in elderly individuals [27]. According to the most recent survey from the American Academy of Sleep Medicine (AASM), the sleep health research authority, 71.9% of the general population aged 40–85 years suffers from this disease [27]. Because people are unconscious when sleeping, it is difficult to discover OSA in the primary care stage. A commonly used means for self-diagnosing OSA is the screening tool, and the STOP-Bang questionnaire is the most well-known [28].
There are eight questions in the STOP-Bang questionnaire. We collected 3,931 people’s STOP-Bang data and Apnea-Hypopnea Index (AHI) from the Sleep Heart Health Study dataset [29], where AHI is regarded as the gold standard indicator for diagnosing OSA. The participants’ age ranged from 39 to 95 years. None of these 3,931 people had been diagnosed with OSA by a medical doctor or accepted an OSA treatment before collecting the data. An AHI ≥ 5 is a critical value to judge OSA, and there are 2,730 objects, occupying approximately 70% of the 3,931 people who should be diagnosed with OSA. We applied the algorithm to 8-dimensional input data reflecting the eight questions from the 3,931 objects with the prior knowledge of 2,730 objects for the positive cluster and 1,201 objects for the negative cluster. The clustering results were evaluated by comparing the AHI value judgments. For a healthcare problem, medical doctors are generally more concerned more about the sensitivity of the positive rate. Hence, this index is introduced as another evaluation indicator, in addition to accuracy and F1_score. The clustering results using the four methods are shown in Table 4.
The Cp in Table 4 presents a positive cluster, and Cp size is recorded as the data population classified into the positive cluster (differences from the true cluster size, 2,730). Table 4 clearly shows that with the constraint of the prior knowledge, the difference between the data populations classified into each cluster of the proposed method and the given size is smallest, and the accuracy and sensitivity are also much higher than those of the other methods.
The size-insensitivity problem is common in objective function-based clustering methods such as FCM. FCM clusters data objects by minimizing the sum of Euclidean distance errors among data, which causes data objects at the edge of larger clusters to drag the center of adjacent smaller clusters toward itself. This movement leads to the clustering results of FCM often having balanced sizes.
To solve this size-insensitivity problem, improved methods such as SIIB-FCM and KL-FCM introduce extra information, such as data populations and distribution characteristics, to correct the shortcomings caused by the objective functions based on the Euclidean distance. Nevertheless, the only objective function that these methods attempt to optimize is still the sum of variance errors of all data points, which leads to them being sensitive to the distance between neighboring clusters. This kind of simultaneous use of a variety of information causes different types of information to interfere with each other; thus, the algorithms cannot fully mine all the information carried by the data. The proposed method divides the optimization procedure into two stages: one for variance error optimization, and the other for extra-knowledge optimization. The segmented use of the objective functions helps the proposed algorithm better utilize the necessary information hidden in the data.
Additionally, attaching extra knowledge as clustering data objects may cause unnecessary biases in the original data structure. For one thing, the a priori knowledge utilized by the algorithm is not groundless rumors. It must be the knowledge carried by the input data objects, and the use of this knowledge should not affect the hidden information in the dataset that the algorithm hopes to extract. In addition, the proposed algorithm has some tolerance to errors in the input information of cluster size, which will help reduce the influence of the biases in application cases.
The size-insensitivity problem is one of the significant shortcomings of traditional FCM and its variations, resulting in a weak clustering probability when dealing with unbalanced datasets. Thus, this paper proposes an improved fuzzy clustering method based on the assumption of obtaining “cluster sizes” as a priori information and subjoins a size objective function to take advantage of the information lost in traditional FCM. The algorithm contains a two-stage adjustment. One is the position adjustment, to find the most suitable location for each cluster. The other is a further adjustment that continues to optimize the size objective function by changing the cluster shape. Additionally, utilizing the Mahalanobis distance to define the moving directions during the adjustment procedure enhances the capacity of the algorithm to deal with multi-distribution datasets. Compared with other algorithms aimed at solving the same problem, such as KL-FCM and SIIB-FCM, the proposed method can extract the actual data distribution more correctly and has a high tolerance to cluster distance. Additionally, the proposed CSCD-FCM can offer the correct clustering solution, even when the input information contains errors.
In this study, the proposed method is proven to have the ability to solve clustering problems with different distributions of data to some degree. However, when the cluster shape becomes concave, the algorithm performs unsatisfactorily. In the future, we would like to combine the CSCD-FCM with fuzzy variation strength in dealing with arbitrary shape clusters to increase its application range.
Overall flow of the algorithm.
Data distribution of the input dataset.
FCM results and peak and floor of cluster 2.
Density center of cluster 2.
Cluster position adjustment of cluster 2.
Cluster shape adjustment of cluster 2.
Final results.
Numerical dataset with two clusters and different data distributions.
Clustering results using the four algorithms: (a) FCM, (b) SIIB-FCM, (c) KL-FCM, and (d) CSCD-FCM.
Comparison of the accuracy of four algorithms with various distances.
Comparison of the F1_score of four algorithms with various distances.
Changes in indices as cluster size decreases.
Changes in indices as cluster size increases.
Table 1 . Aims and contents of the four experiments.
Contents | |
---|---|
Experiment 1 | Data structure extraction test |
Experiment 2 | Distance tolerance test |
Experiment 3 | Robustness test |
Experiment 4 | Practical example of a healthcare problem |
Table 2 . Cluster size results for two clusters with different distributions.
Method | Cluster 1 | Cluster 2 | ||
---|---|---|---|---|
Size | Difference | Size | Difference | |
FCM | 1177 | 823 | 923 | −823 |
SIIB-FCM | 807 | 1193 | 1293 | −1193 |
KL-FCM | 1582 | 418 | 518 | −418 |
CSCD-FCM |
Table 3 . Evaluation indices for two clusters with different distributions.
Method | Accuracy | F1_score | DI | XB |
---|---|---|---|---|
0.6080 | 0.6528 | 0.0043 | 0.1532 | |
0.4321 | 0.6095 | 0.0027 | 0.0428 | |
0.7171 | 0.6922 | 0.0016 | 72.3254 | |
Table 4 . Evaluation indices for the practical example.
Method | Accuracy | F1_score | Sensitivity | Cp size |
---|---|---|---|---|
0.3885 | 0.3531 | 0.4608 | 2190(−540) | |
0.5785 | 0.5517 | 0.6150 | 2285(−445) | |
0.4861 | 0.5523 | 0.3795 | 1362(−1368) | |
Wiharto, Aditya K. Wicaksana, and Denis E. Cahyani
International Journal of Fuzzy Logic and Intelligent Systems 2021; 21(2): 189-203 https://doi.org/10.5391/IJFIS.2021.21.2.189Sunjin Yu and Changyong Yoon
International Journal of Fuzzy Logic and Intelligent Systems 2019; 19(2): 97-102 https://doi.org/10.5391/IJFIS.2019.19.2.97Maslina Zolkepli,Fangyan Dong,Kaoru Hirota
International Journal of Fuzzy Logic and Intelligent Systems 2014; 14(4): 256-267 https://doi.org/10.5391/IJFIS.2014.14.4.256Overall flow of the algorithm.
|@|~(^,^)~|@|Data distribution of the input dataset.
|@|~(^,^)~|@|FCM results and peak and floor of cluster 2.
|@|~(^,^)~|@|Density center of cluster 2.
|@|~(^,^)~|@|Cluster position adjustment of cluster 2.
|@|~(^,^)~|@|Cluster shape adjustment of cluster 2.
|@|~(^,^)~|@|Final results.
|@|~(^,^)~|@|Numerical dataset with two clusters and different data distributions.
|@|~(^,^)~|@|Clustering results using the four algorithms: (a) FCM, (b) SIIB-FCM, (c) KL-FCM, and (d) CSCD-FCM.
|@|~(^,^)~|@|Comparison of the accuracy of four algorithms with various distances.
|@|~(^,^)~|@|Comparison of the F1_score of four algorithms with various distances.
|@|~(^,^)~|@|Changes in indices as cluster size decreases.
|@|~(^,^)~|@|Changes in indices as cluster size increases.