Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(4): 346-357

Published online December 25, 2020

https://doi.org/10.5391/IJFIS.2020.20.4.346

© The Korean Institute of Intelligent Systems

Cluster Size-Constrained Fuzzy C-Means with Density Center Searching

Jiarui Li1, Yukio Horiguchi2, and Tetsuo Sawaragi1

1Department of Mechanical Engineering and Science, Graduate School of Engineering, Kyoto University, Kyoto, Japan
2Faculty of Informatics, Kansai University, Osaka, Japan

Correspondence to :
Jiarui Li (ljr10225008@gmail.com)

Received: June 18, 2020; Revised: December 7, 2020; Accepted: December 15, 2020

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Fuzzy C-means (FCM) has a definite limitation when partitioning a dataset into clusters with varying sizes and densities because it ignores the scale difference in different dimensions of input data objects. To alleviate this cluster size insensitivity, we propose a wrapper algorithm for FCM by introducing cluster size as a priori information and limiting the search direction on the basis of density benchmarks (CSCD-FCM). This method is divided into two stages. The first stage adjusts the position of each cluster while maintaining its shape, and the second stage changes the shape of each cluster while maintaining its center. Both steps modify fuzzy partitions generated by FCM-like soft clustering methods by optimizing a “size-constrained” objective function. Numerical and practical experiments with unbalanced cluster size settings demonstrate the effectiveness of this method for extracting actual cluster structures, as well as achieving the desired cluster populations.

Keywords: Fuzzy C-means, Clustering, Cluster size insensitivity

Clustering is a method for grouping data objects into clusters, such that objects in the same cluster have similar characteristics, whereas those in different clusters have disparate characteristics [1]. However, there is no absolute rule for defining cluster characteristics, which is why different clustering techniques have been proposed on the basis of various inclusion principles [2]. According to Fraley and Raftery [3], clustering methods can be categorized as hard clustering or soft (fuzzy) clustering. Hard clustering methods, such as K-means, strictly allocate each object to only one cluster, so these methods require clear cluster boundaries. However, in real life, data objects may naturally belong to either single or multiple groups, which limits the application of hard clustering [4]. Therefore, fuzzy clustering techniques have been developed [5]. Fuzzy clustering is more flexible because it allows every object to belong to more than one cluster [6]. Fuzzy C-means (FCM) is a widely used partitioning-based fuzzy clustering method that divides a set of data objects into fuzzy partitions by minimizing the intra-cluster variance and maximizing the inter-cluster variance in accordance with an objective function [7].

Because of the flexibility involved with acting on overlapped datasets, FCM and its expansions have been widely used in diverse fields such as image processing and market segmentation [810].

However, a common failure of partition-based clustering methods is their inability to deal with datasets of varying cluster size and data density, which we call the “size-insensitivity problem.” As the most commonly used partition-based clustering method, FCM is undoubtedly not immune to this issue [11].

For one thing, as a Euclidean distance-based method, FCM ignores the scale difference in different dimensions of input data objects, resulting in weak discernment of diverse-distribution data structures. To resolve this defect, Mahalanobis distance has been added to the original FCM by many researchers [1214]. Using the covariance matrix to evaluate the correlations among different dimensions makes such FCM extensions better at extracting data structures with various distributions. However, they sometimes fail when the difference in the distribution is small. In a similar way, Gaussian-based models are also good at distinguishing the distinctions of data distributions [15]. Ichihashi et al. [16] found that when setting a fuzzifier at a particular value, FCM achieves the same effect as Gaussian mixture density decomposition (GMDD) by regularizing the K-L information (KFCM) [17, 18]. KFCM is a partition-based improvement that solves the size-insensitive problem to some degree, but it is limited by the boundary conditions and cannot deal with clusters firmly located next to each other.

Moreover, because FCM uses a sum-of-squared-error objective function to obtain solutions, it tends to drift centers of smaller clusters toward more massive adjacent clusters and to equalize cluster populations. To overcome this problem, the following algorithms have been proposed: fuzzy maximum likelihood estimation (FMLE) [19], semi-supervised FCM (ss-FCM) [20], cluster-size-insensitive FCM (CSI-FCM) [21], and size-insensitive integrity-based FCM (SIIB-FCM) [22]. Lin et al. [22] showed that SIIB-FCM is superior to the other three, so we will not repeat the details here. However, SIIB-FCM is quite sensitive to the distances among clusters, which means it is only good at handling datasets with relatively clear cluster boundaries. This shortcoming is apparent when dealing with practical issues, such as removing small amounts of noise mixed in with the raw signal.

Considering the weaknesses mentioned above, a more effective method is needed to solve the size-insensitivity problem of FCM. In an actual application, one can sometimes obtain background information concerning data objects, such as the expected number of data groups and the approximate size of each group, from preliminary surveys, prior studies, or the opinions of domain experts in actual applications of data clustering. The proportion of each cluster size, as well as the number of clusters, may well be available as a priori knowledge for clustering. Clustering is based on a similarity measure to group semblable data objects together, so it is commonly utilized in areas such as market segmentation, vehicle routing selection, and healthcare problems [23, 24]. On these occasions, users employ certain conditions for various classification purposes. As a result, if such additional information is introduced as a priori information, it will likely help clustering algorithms find the intrinsic structure behind observations.

In our previous studies [25, 26], we developed an algorithm for FCM to assimilate information regarding the desired cluster sizes by refining the cluster membership functions. This method has demonstrated advantages over SIIB-FCM. In this paper, we extend the wrapper algorithm by defining more convincing adjustment directions by introducing the Mahalanobis distance.

The significant contributions of this study are as follows. 1) The present paper proposes a new, improved fuzzy clustering method to solve the size-insensitivity problem. 2) The proposed method is good at extracting data structures, regardless of the cluster numbers or data dimensions, and can deal with multi-distribution datasets. 3) Unlike other similar algorithms, the proposed CSCD-FCM is not sensitive to cluster distance, which widens its application range. 4) The proposed method can correct the error input of the priority information itself, so its application is soft and flexible.

The remainder of this paper is organized as follows. In Section 2, we review the FCM. Section 3 proposes a new algorithm to assimilate a priori knowledge of the cluster size proportions into fuzzy clustering. Section 4 presents experiments to show the advantages of the proposed method compared with other algorithms, and Section 5 discusses the results. Finally, concluding remarks are presented in Section 6.

Given a dataset that contains n objects, X = {x1, x2, · · ·, xn}, the FCM algorithm attempts to minimize the following objective function:

J=j=1ni=1cuijmxj-vi||2,

where c is the number of clusters, vi is the centroid of the ith cluster, uij is the degree of membership of xj in the ith cluster, and m is the fuzzifier exponent (mR and m > 1). The fuzzifier m results in fuzzier clusters with smaller values. For every cluster, boundary conditions include i=1cuij=1 and uij ∈ [0, 1]. A solution of objective function (1) can be obtained by an iterative process that alternately updates the memberships and the cluster centers. The membership uij is obtained from a given set of cluster centers by

uij=1/k=1c(xj-vixj-vk)2m-1.

In contrast, the cluster center vi is calculated from a given set of memberships by

vi=j=1nxj*uijm/j=1nuijm.

As mentioned in Section 1, FCM attempts to minimize the objective function (1) and thus tends to give data objects a lower membership grade for clusters with higher density. As a result, smaller clusters may acquire more objects than much larger clusters, which leads to balanced sizes and may lead to misclassification of data objects. SIIB-FCM aims to solve this problem.

3.1 “Size-Constrained” Objective Function

The proposed method aims to adjust multi-dimensional fuzzy clusters to attain the desired population sizes. We assume that the desired proportion of cluster sizes is available as a priori knowledge for the clustering problem in question. In this premise, it is hypothesized that as the calculated cluster size approaches the given cluster population, the data structure that will be extracted is more suitable. To this end, the proposed algorithm modifies the fuzzy partitions generated by FCM-like soft clustering methods by optimizing a “size-constrained” objective function, as shown in function (4).

Jsi=|j=1nuij-Si|.

Here, Si represents the target size of cluster ci, which is constrained by i=1cSi=n, where n is the total population of data. uij is the fuzzy membership of the data object xj in ci, and j=1nuij gives the calculated size of the fuzzy cluster ci. Jsi evaluates the difference in the generated cluster population from the given cluster size Si for the ith cluster.

The method is divided into two stages. The first stage adjusts the position of each cluster while maintaining its shape, and the second stage adjusts the shape of each cluster while maintaining its center position. For a clear understanding, we present the specific procedures of the adjustment in the next section.

3.2 Procedure of CSCD-FCM

As mentioned above, the proposed method uses two-stage adjustments intended to be applied to fuzzy partitions generated by other soft clustering methods. Both stages aim to optimize function (4). Figure 1 shows the entire flow of the algorithm.

After initializing the clustering solution, the adjustment procedure for each cluster is followed, and steps 2 to 4 are repeated. The order should be from the largest cluster to the smallest one. After completing the adjustment for one cluster, the data belonging to the adjusted cluster from the input that obtained the highest membership moves to the target cluster and the remainder of the dataset moves to step 2 as the new input for finding the density center of the next cluster. The memberships used for removing data come from the new membership matrix of the adjusted cluster.

For a more fundamental understanding, we use the following example to expound.

As shown in Figure 2, the input dataset contains three clusters with discrepant data populations of 100/1500/50. The following sections describe in detail the steps mentioned above using this example.

3.2.1 Cluster prototype generation

Given a dataset that consists of n objects with p dimensions X = {x1, x2, · · ·, xn}, soft clustering methods can generate fuzzy partitions of the dataset. In the case of the FCM algorithm, a p × c matrix

V=(v1v2vc)=(v11v21vc1v12v22vc2v1pv2pvcp)

expresses the cluster prototypes, where c is the number of clusters and vi is a p-dimensional vector that represents the centroid of the ith cluster ci. With the cluster center matrix V, we can calculate the membership degree of each object xj in each cluster ci using function (2). In this paper, we assume that m is two when no domain knowledge is available.

In addition, according to function (2), every cluster ci has the highest value of membership, that is, uij= 1, at its center, but the lowest value, that is, uij= 0, at the centers of the other clusters. For convenience while explaining the following processes, we name these positions the cluster peak and floor, respectively, for each cluster. As shown in Figure 3, FCM classifies data points into three clusters with similar populations, and v1, v2, v3 makes up the cluster center matrix. Using cluster 2 as an example, the position of v2 is the peak of cluster 2, while the positions of v1 and v3 are floors, which are the inputs to the following steps.

The subsequent stages of the proposed method regard the cluster center matrix V generated by FCM-like clustering algorithms as the initial solution.

3.2.2 Density prototypes

The “cluster center” is the arithmetic mean of all the points belonging to the cluster. Each point is closer to its cluster center than to the other cluster centers. The original FCM uses Euclidean distance as the only measurement standard for the objective function, causing it to drive smaller cluster centers closer to the larger ones. This is why an equal tumble occurs. As shown in Figure 4, it is obvious that point x is closer to the center of cluster 3 than cluster 2. Thus, FCM makes an erroneous judgment on data point x by placing it in cluster 2.

The limited consideration of only one standard leads to a wrong answer for clustering. To determine the most suitable positions of cluster centers, we introduce the cluster size Si as another reference index. Under the constraints of the given “prior knowledge” we define a new concept, the density prototype pi for each cluster.

The density reflects the closeness of the data near each other in a certain area, and it can be calculated as Sizei/Volumei, which expresses the number of data per unit volume. The cluster size Si is known knowledge, so as the value of Volumei decreases, the density increases. We use the sum of distances between each data point with its Si closest neighbor points as Volumnei. Furthermore, the density center pi should be at the densest position of the ith cluster. In other words, the Volumnei center on pi should be the smallest. The density prototype of cluster i is generated by the following steps.

First, a Mahalanobis distance matrix is created using all input objects

D=(dist1(k)dist2(k)distj(k))=(0d21dj1d120dj2d1jd2j0),         (k=1,2,,j).

In function (6), each j-dimension vector distj presents the distance between each data object xj and all the other objects. For a natural expression, djj, the distance to itself, is also presented in the matrix, which has a value of 0. The dkj Mahalanobis distances are shown in function (7).

dkj=(xk-xj)TS-1(xk-xj),

where S is the covariance matrix of xk and xj.

For each distance vector distj, we find the number of Si smallest components that have xjSi neighbor data objects nearest to it. Then, we make a sum of these Si components of distj and record it as Volumnej. Furthermore, we determine the density prototypes pi = xp for the ith cluster by seeking the data object xp that has the minimum value of Volumnej, j = 1, 2, . . ., n. As mentioned above, pi is considered the density prototype because compared with other data objects, the Si data objects around it are always the most compact, regardless of the numerical value or distribution features. pi is an appropriate symbol for defining the search direction.

Here, what should be given attention is that the position of density prototype pi is always different from the cluster center. The density prototype pi is used as a benchmark in the next two steps, which guide the cluster centers toward a high-density area. Hence, the cluster center appears on the line of the moving direction but may not coincide with the density prototype. As shown in Figure 4, the cross marks present some typical data points in cluster 2. We assume that all other data points are distributed uniformly. The result of the “cluster center” position of cluster 2 given by the proposed method will not be p2 partial to data point x.

3.2.3 Cluster position adjustment

The first stage of adjustment is intended to determine the desired position of each cluster to obtain the target cluster sizes. In this stage, the algorithm explores new positions for both the peak and floor points of each cluster under constraints that limit the search directions.

We first duplicate the cluster center matrix V for each cluster. Let us here define the cluster ci’s duplicates of the matrix V as V(i). The ith column of V(i) represents the peak of the cluster ci, and the other columns represent its floor points. Conversely, the search direction of cluster ci is constrained by pointing from the peak to pi. During the search, the cluster peak and floor positions will translate together in the same direction. The length between the peak and pi determines the search interval. In our experience, the searching distance should be over 1.5 multiples of the peak ~ pi distance to obtain a broader searching range, which depends on the fuzzy degree of the dataset. Fuzzier data require a smaller search distance. However, the multiple does not need to be set to more than 3, which wastes calculation time. The multiple of the peak ~ pi distance is recommended to be 2 if there is no domain knowledge. Figure 5 shows the adjustment procedure for cluster 2 in this step.

In Figure 5, the adjusting direction of cluster 2 is from v2 pointing to p2. The searching distance is double the distance between v2 and p2. The peak and floors are moved together in the same direction and searching distance.

As the search direction and the interval are specified, the algorithm attempts to minimize objective function (4) for the chosen cluster. The search updates the peak/floor matrix of the cluster V(i). As shown in Figure 5, the black points are the results of this procedure for cluster 2.

3.2.4 Cluster shape adjustment

The second stage of adjusting the membership functions changes the shape of each cluster to minimize objective function (4) by moving the clusters’ floor positions while fixing their peak positions.

Given peak/floor matrixes V(i) obtained from the first stage, the algorithm attempts to search for a better floor point placement for each cluster ci. During the search, all the floor points of the target cluster are translated together. The searching direction and the interval are the same as in the first stage.

As shown in Figure 6, during the shape adjustment procedure, the peak v2(2) of cluster 2 is kept flexible and the floors v1(2) and v3(2) are adjusted in the same direction as in the position adjustment step.

3.2.5 Generate new membership functions

The result of the two-stage adjustment is a set of updated peak/floor matrixes V(i) for all clusters. By applying function (2), we can obtain different membership matrices U(i) = (uij) 1 ≤ ic, 1 ≤ jn from each peak/floor matrix V(i). The proposed method creates a net membership matrix with the ith row out of U(i) arranged one over the other, as shown below:

U=(U1(1)U2(2)U3(3))=(u11(1)u12(1)u1n(1)u21(2)u22(2)u2n(2)uc1(c)uc2(c)ucn(c)).

At the same time, we combine the peak vectors of each V(i) as the new cluster center matrix:

V_new=(V1(1)V2(2)Vc(c)).

Figure 7 shows the results of V_new and the classification of the proposed algorithm that we used as an example.

We use numerical and practical experiments to examine the effectiveness of the proposed method in comparison with FCM, SIIB-FCM, and KL-FCM. To test the stronger points of the proposed algorithm, we conducted four experiments. The aims and contents of the four experiments are shown in Table 1.

4.1 Capacity for Correctly Extracting Data Structures

In this experiment, a numerical dataset is prepared to test the correct clustering power of the proposed method by evaluating the error rate of the data population classified into each cluster and the accuracy of the clustering results. Additionally, for the numerical dataset, we also evaluated the indexes typically used for soft clustering methods: Dunn’s index (DI) and the Xie-Beni index (XB).

The experiment used artificial datasets with different cluster populations. Section 3 uses a three-cluster dataset as an example to explain the algorithm. Here, we created another dataset to test the effectiveness of the proposed algorithm under different data distributions. Figure 8 shows the original distributions of the input dataset.

For this test, there are three cross-test datasets, and the size of the cluster is set as 2000/50. Figure 9 shows the results.

KL-FCM, a segment-based clustering method using the Mahalanobis distance, highlights its advantages when input data have diverse distributions. The proposed algorithm also draws on the advantage of the Mahalonobis distance in the stage of finding the “density center.” Thus, CSCD-FCM is tolerant to multi-distribution data to some degree.

Tables 2 and 3 show the related evaluation indices.

Here, SIIB-FCM shows the lowest XB value on this occasion, which is primarily a result of the reduction of memberships for all the data. However, compared with the accuracy of extracting the exact data structure, the decrease in the XB index does not make sense. Considering the size-insensitivity problem, CSCD-FCM performs best here.

4.2 Clustering Capacity with Varying Distances among Clusters

SIIB-FCM and KL-FCM were proposed to improve FCM performance on a clustering dataset with unbalanced cluster sizes and different data distributions. Both methods perform well when clusters are sufficiently far apart, but not when clusters are close together. We call this the “distance-sensitivity problem.” This problem commonly occurs in unsupervised clustering methods, especially when the cluster sizes are unbalanced. The following experiments focus on the distance-sensitivity problem, and the results show that the proposed method has a high tolerance with respect to the compactness of clusters.

In this experiment, different datasets were generated for two circular clusters using normal-distribution data with cluster sizes of 2000/50 and various cluster distances, and a three-dataset validation was conducted for each kind of input. The center distance of the circle measures the distance between clusters.

We used the accuracy and F1_score to evaluate the clustering possibility of the four algorithms under close to far cluster distances. Figures 10 and 11 show the accuracy and F1_scores, respectively, of FCM, SIIB, KL-FCM, and CSCD-FCM with various cluster distances.

The accuracy and F1_score reflect the degree of correct classification. Figures 10 and 11 show that the size-insensitivity problem exists in FCM, regardless of the distances among the clusters. When the interval between the two clusters is sufficiently large, e.g., over 4000, as shown in this example, KL-FCM and SIIB-FCM can obtain better results than FCM. However, when the clusters are closer together, KL-FCM and SIIB-FCM sometimes perform worse than the original FCM.

In contrast, the proposed algorithm has almost no sensitivity to distance. This means that even if the data distribution is very tight, the clustering result is still good. This strong point may make CSCD-FCM more widely applicable than similar algorithms.

4.3 Robustness of the Algorithm

This experiment is aimed at testing whether the proposed algorithm can obtain adequate results when the cluster size information contains errors. The datasets used in this experiment were still generated in two circular clusters using normal-distribution data with cluster sizes of 2000/50. We evaluated the accuracy and F1_score by fixing one cluster’s size, which is the correct size information, and reducing or increasing the other clusters’ sizes by a certain percentage of the actual size. Figures 12 and 13 show the results.

The results show that the proposed algorithm is robust to the error input of size information. The algorithm still obtains both accuracy and F1_score greater than 0.95 when the given information reduces the larger cluster to 70% of its actual size and increases the smaller cluster to 800% of its actual size. Thus, the proposed algorithm has a degree of tolerance to the “prior information error.” Considering this, when using the algorithm to deal with a practical problem, the requirements for prior input knowledge regarding size information are not strict. An inevitable ambiguous error is permitted.

4.4 A Practical Example

In this experiment, we conducted a practical application on a healthcare problem. Obstructive sleep apnea (OSA) is a common sleep disorder disease, especially in elderly individuals [27]. According to the most recent survey from the American Academy of Sleep Medicine (AASM), the sleep health research authority, 71.9% of the general population aged 40–85 years suffers from this disease [27]. Because people are unconscious when sleeping, it is difficult to discover OSA in the primary care stage. A commonly used means for self-diagnosing OSA is the screening tool, and the STOP-Bang questionnaire is the most well-known [28].

There are eight questions in the STOP-Bang questionnaire. We collected 3,931 people’s STOP-Bang data and Apnea-Hypopnea Index (AHI) from the Sleep Heart Health Study dataset [29], where AHI is regarded as the gold standard indicator for diagnosing OSA. The participants’ age ranged from 39 to 95 years. None of these 3,931 people had been diagnosed with OSA by a medical doctor or accepted an OSA treatment before collecting the data. An AHI ≥ 5 is a critical value to judge OSA, and there are 2,730 objects, occupying approximately 70% of the 3,931 people who should be diagnosed with OSA. We applied the algorithm to 8-dimensional input data reflecting the eight questions from the 3,931 objects with the prior knowledge of 2,730 objects for the positive cluster and 1,201 objects for the negative cluster. The clustering results were evaluated by comparing the AHI value judgments. For a healthcare problem, medical doctors are generally more concerned more about the sensitivity of the positive rate. Hence, this index is introduced as another evaluation indicator, in addition to accuracy and F1_score. The clustering results using the four methods are shown in Table 4.

The Cp in Table 4 presents a positive cluster, and Cp size is recorded as the data population classified into the positive cluster (differences from the true cluster size, 2,730). Table 4 clearly shows that with the constraint of the prior knowledge, the difference between the data populations classified into each cluster of the proposed method and the given size is smallest, and the accuracy and sensitivity are also much higher than those of the other methods.

The size-insensitivity problem is common in objective function-based clustering methods such as FCM. FCM clusters data objects by minimizing the sum of Euclidean distance errors among data, which causes data objects at the edge of larger clusters to drag the center of adjacent smaller clusters toward itself. This movement leads to the clustering results of FCM often having balanced sizes.

To solve this size-insensitivity problem, improved methods such as SIIB-FCM and KL-FCM introduce extra information, such as data populations and distribution characteristics, to correct the shortcomings caused by the objective functions based on the Euclidean distance. Nevertheless, the only objective function that these methods attempt to optimize is still the sum of variance errors of all data points, which leads to them being sensitive to the distance between neighboring clusters. This kind of simultaneous use of a variety of information causes different types of information to interfere with each other; thus, the algorithms cannot fully mine all the information carried by the data. The proposed method divides the optimization procedure into two stages: one for variance error optimization, and the other for extra-knowledge optimization. The segmented use of the objective functions helps the proposed algorithm better utilize the necessary information hidden in the data.

Additionally, attaching extra knowledge as clustering data objects may cause unnecessary biases in the original data structure. For one thing, the a priori knowledge utilized by the algorithm is not groundless rumors. It must be the knowledge carried by the input data objects, and the use of this knowledge should not affect the hidden information in the dataset that the algorithm hopes to extract. In addition, the proposed algorithm has some tolerance to errors in the input information of cluster size, which will help reduce the influence of the biases in application cases.

The size-insensitivity problem is one of the significant shortcomings of traditional FCM and its variations, resulting in a weak clustering probability when dealing with unbalanced datasets. Thus, this paper proposes an improved fuzzy clustering method based on the assumption of obtaining “cluster sizes” as a priori information and subjoins a size objective function to take advantage of the information lost in traditional FCM. The algorithm contains a two-stage adjustment. One is the position adjustment, to find the most suitable location for each cluster. The other is a further adjustment that continues to optimize the size objective function by changing the cluster shape. Additionally, utilizing the Mahalanobis distance to define the moving directions during the adjustment procedure enhances the capacity of the algorithm to deal with multi-distribution datasets. Compared with other algorithms aimed at solving the same problem, such as KL-FCM and SIIB-FCM, the proposed method can extract the actual data distribution more correctly and has a high tolerance to cluster distance. Additionally, the proposed CSCD-FCM can offer the correct clustering solution, even when the input information contains errors.

In this study, the proposed method is proven to have the ability to solve clustering problems with different distributions of data to some degree. However, when the cluster shape becomes concave, the algorithm performs unsatisfactorily. In the future, we would like to combine the CSCD-FCM with fuzzy variation strength in dealing with arbitrary shape clusters to increase its application range.

Fig. 1.

Overall flow of the algorithm.


Fig. 2.

Data distribution of the input dataset.


Fig. 3.

FCM results and peak and floor of cluster 2.


Fig. 4.

Density center of cluster 2.


Fig. 5.

Cluster position adjustment of cluster 2.


Fig. 6.

Cluster shape adjustment of cluster 2.


Fig. 7.

Final results.


Fig. 8.

Numerical dataset with two clusters and different data distributions.


Fig. 9.

Clustering results using the four algorithms: (a) FCM, (b) SIIB-FCM, (c) KL-FCM, and (d) CSCD-FCM.


Fig. 10.

Comparison of the accuracy of four algorithms with various distances.


Fig. 11.

Comparison of the F1_score of four algorithms with various distances.


Fig. 12.

Changes in indices as cluster size decreases.


Fig. 13.

Changes in indices as cluster size increases.


Table. 1.

Table 1. Aims and contents of the four experiments.

Contents
Experiment 1Data structure extraction test
Experiment 2Distance tolerance test
Experiment 3Robustness test
Experiment 4Practical example of a healthcare problem

Table. 2.

Table 2. Cluster size results for two clusters with different distributions.

MethodCluster 1Cluster 2
SizeDifferenceSizeDifference
FCM1177±26823±26923±26−823±26
SIIB-FCM807±1051193±1051293±105−1193±105
KL-FCM1582±13418±13518±13−418±13
CSCD-FCM2000±10±1100±10±1

Table. 3.

Table 3. Evaluation indices for two clusters with different distributions.

MethodAccuracyF1_scoreDIXB
FCM0.6080±0.010.6528±0.00330.0043±0.00160.1532±0.0032
SIIB-FCM0.4321±0.050.6095±0.01190.0027±0.00130.0428±0.0031
KL-FCM0.7171±0.130.6922±0.0350.0016±0.000172.3254±58.074
CSCD-FCM0.9998±0.00030.9991±0.00150.303±0.25650.1868±0.0086

Table. 4.

Table 4. Evaluation indices for the practical example.

MethodAccuracyF1_scoreSensitivityCp size
FCM0.38850.35310.46082190(−540)
SIIB-FCM0.57850.55170.61502285(−445)
KL-FCM0.48610.55230.37951362(−1368)
CSCD-FCM0.72020.65520.83592934(+204)

  1. Rokach, L, and Maimon, O (2005). Clustering methods. Data Mining and Knowledge Discovery Handbook. Boston, MA: Springer, pp. 321-352 https://doi.org/10.1007/0-387-25465-X_15
    CrossRef
  2. Maimon, O, and Rokach, L (2005). Data Mining and Knowledge Discovery Handbook. Boston, MA: Springer
    CrossRef
  3. Fraley, C, and Raftery, AE (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal. 41, 578-588. https://doi.org/10.1093/comjnl/41.8.578
    CrossRef
  4. Nayak, J, Naik, B, and Behera, H (2015). Fuzzy C-means (FCM) clustering algorithm: a decade review from 2000 to 2014. Computational Intelligence in Data Mining. New Delhi, India: Springer, pp. 133-149 https://doi.org/10.1007/978-81-322-2208-8_14
  5. Banu, PN, and Andrews, S (2015). Performance analysis of hard and soft clustering approaches for gene expression data. International Journal of Rough Sets and Data Analysis. 2, 58-69. https://doi.org/10.4018/ijrsda.2015010104
    CrossRef
  6. Bora, DJ, Gupta, D, and Kumar, A. (2014) . A comparative study between fuzzy clustering algorithm and hard clustering algorithm. Available https://arxiv.org/abs/1404.6059
  7. Bezdek, JC, Ehrlich, R, and Full, W (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences. 10, 191-203.
    CrossRef
  8. Mohamed, NA, Ahmed, MN, and Farag, A . Modified fuzzy c-mean in medical image segmentation., Proceedings of 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat No 99CH36258), 1999, Phoenix, AZ, Array, pp.3429-3432. https://doi.org/10.1109/ICASSP.1999.757579
  9. Ma, Z, Tavares, JMR, Jorge, RN, and Mascarenhas, T (2010). A review of algorithms for medical image segmentation and their applications to the female pelvic cavity. Computer Methods in Biomechanics and Biomedical Engineering. 13, 235-246. https://doi.org/10.1080/10255840903131878
    CrossRef
  10. Hsu, TH (2000). An application of fuzzy clustering in group-positioning analysis. Proceedings National Science Council of the Republic of China. 10, 157-167.
  11. Saxena, A, Prasad, M, Gupta, A, Bharill, N, Patel, OP, Tiwari, A, Joo, EM, Ding, W, and Lin, CT (2017). A review of clustering techniques and developments. Neurocomputing. 267, 664-681. https://doi.org/10.1016/j.neucom.2017.06.053
    CrossRef
  12. Huang, SF, Lin, YH, Yih, JM, and Tseng, JC (2018). Fuzzy clustering algorithm based on mahalanobis distances with recursive process. International Journal of Intelligent Technologies & Applied Statistics. 11, 221-239. https://doi.org/10.6148/IJITAS.2017.1004.02
  13. Heidari, H, Moslehi, Z, Mirzaei, A, and Safayani, M (2018). Bayesian distance metric learning for discriminative fuzzy c-means clustering. Neurocomputing. 319, 21-33. https://doi.org/10.1016/j.neucom.2018.08.071
    CrossRef
  14. Haldar, NAH, Khan, FA, Ali, A, and Abbas, H (2017). Arrhythmia classification using Mahalanobis distance based improved fuzzy C-means clustering for mobile health monitoring systems. Neurocomputing. 220, 221-235. https://doi.org/10.1016/j.neucom.2016.08.042
    CrossRef
  15. Zhuang, X, Huang, Y, Palaniappan, K, and Zhao, Y (1996). Gaussian mixture density modeling, decomposition, and applications. IEEE Transactions on Image Processing. 5, 1293-1302. https://doi.org/10.1109/83.535841
    Pubmed CrossRef
  16. Ichihashi, H, Miyagishi, K, and Honda, K . Fuzzy c-means clustering with regularization by KL information., Proceedings of the 10th IEEE International Conference on Fuzzy Systems (Cat No 01CH37297), 2001, Melbourne, Australia, Array, pp.924-927. https://doi.org/10.1109/FUZZ.2001.1009107
  17. Dempster, AP, Laird, NM, and Rubin, DB (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological). 39, 1-22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
  18. McLachlan, GJ, and Krishnan, T (1997). The EM Algorithm and Extensions. New York, NY: John Wiley & Sons
  19. Gath, I, and Geva, AB (1989). Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 11, 773-780. https://doi.org/10.1109/34.192473
    CrossRef
  20. Bensaid, AM, Hall, LO, Bezdek, JC, and Clarke, LP (1996). Partially supervised clustering for image segmentation. Pattern Recognition. 29, 859-871. https://doi.org/10.1016/0031-3203(95)00120-4
    CrossRef
  21. Noordam, JC, Van Den Broek, WHAM, and Buydens, LMC (2002). Multivariate image segmentation with cluster size insensitive fuzzy C-means. Chemometrics and Intelligent Laboratory Systems. 64, 65-78. https://doi.org/10.1016/S0169-7439(02)00052-7
    CrossRef
  22. Lin, PL, Huang, PW, Kuo, CH, and Lai, YH (2014). A size-insensitive integrity-based fuzzy c-means method for data clustering. Pattern Recognition. 47, 2042-2056. https://doi.org/10.1016/j.patcog.2013.11.031
    CrossRef
  23. Irani, J, Pise, N, and Phatak, M (2016). Clustering techniques and the similarity measures used in clustering: a survey. International Journal of Computer Applications. 134, 9-14. https://doi.org/10.5120/ijca2016907841
    CrossRef
  24. Jonietz, D, Bucher, D, Martin, H, and Raubal, M (2018). Identifying and interpreting clusters of persons with similar mobility behaviour change processes. Geographic Technologies for All. Cham, Switzerland: Springer, pp. 291-307 https://doi.org/10.1007/978-3-319-78208-9_15
    CrossRef
  25. Li, J, Horiguchi, Y, and Sawaragi, T . Refining fuzzy c-means membership functions to assimilate a priori knowledge of cluster sizes., Proceedings of 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems (ISIS), 2018, Toyama, Japan, Array, pp.654-659. https://doi.org/10.1109/SCIS-ISIS.2018.00110
  26. Li, J, Horiguchi, Y, and Sawaragi, T . A wrapper algorithm for fuzzy cluster membership modification based on size constraints., Proceedings of the 20th International Symposium on Advanced Intelligent Systems and 2019 International Conference on Biometrics and Kansei Engineering, 2019, Jeju, South Korea, pp.122-130.
  27. Heinzer, R, Vat, S, Marques-Vidal, P, Marti-Soler, H, Andries, D, and Tobback, N (2015). Prevalence of sleep-disordered breathing in the general population: the HypnoLaus study. The Lancet Respiratory Medicine. 3, 310-318. https://doi.org/10.1016/S2213-2600(15)00043-0
    Pubmed KoreaMed CrossRef
  28. Chung, F, Abdullah, HR, and Liao, P (2016). STOP-Bang questionnaire: a practical approach to screen for obstructive sleep apnea. Chest. 149, 631-638. https://doi.org/10.1378/chest.15-0903
    CrossRef
  29. National Sleep Research Resource. (2001) . Sleep Heart Health Study dataset. Available https://sleepdata.org/datasets/shhs

Jiarui Li received her B.S., and M.S. degrees from Beijing Jiaotong University, China, in 2014 and 2017, respectively. She is currently a Ph.D. student at Kyoto University. Her research interests include fuzzy logic and human factors and human-machine interference.

E-mail: ljr10225008@gmail.com


Yukio Horiguchi received his B.S., M.S., and Ph.D. degrees from Kyoto University, Japan. He is currently a Professor at the Faculty of Informatics, Kansai University. His research interests include human factors, human-machine interference, and interface designs. His personal homepage is http://www.syn.me.kyoto-u.ac.jp/horiguchi/.


Tetsuo Sawaragi received his B.S., M.S., and Ph.D. degrees from Kyoto University, Japan. He is currently a Professor at the Department of Mechanical Engineering, Kyoto University. His research interests include system engineering, human-machine system, human-machine interference, and cognitive engineering. His personal homepage is http://www.design.kyoto-u.ac.jp/faculty/t-sawaragi.html.


Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(4): 346-357

Published online December 25, 2020 https://doi.org/10.5391/IJFIS.2020.20.4.346

Copyright © The Korean Institute of Intelligent Systems.

Cluster Size-Constrained Fuzzy C-Means with Density Center Searching

Jiarui Li1, Yukio Horiguchi2, and Tetsuo Sawaragi1

1Department of Mechanical Engineering and Science, Graduate School of Engineering, Kyoto University, Kyoto, Japan
2Faculty of Informatics, Kansai University, Osaka, Japan

Correspondence to:Jiarui Li (ljr10225008@gmail.com)

Received: June 18, 2020; Revised: December 7, 2020; Accepted: December 15, 2020

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Fuzzy C-means (FCM) has a definite limitation when partitioning a dataset into clusters with varying sizes and densities because it ignores the scale difference in different dimensions of input data objects. To alleviate this cluster size insensitivity, we propose a wrapper algorithm for FCM by introducing cluster size as a priori information and limiting the search direction on the basis of density benchmarks (CSCD-FCM). This method is divided into two stages. The first stage adjusts the position of each cluster while maintaining its shape, and the second stage changes the shape of each cluster while maintaining its center. Both steps modify fuzzy partitions generated by FCM-like soft clustering methods by optimizing a “size-constrained” objective function. Numerical and practical experiments with unbalanced cluster size settings demonstrate the effectiveness of this method for extracting actual cluster structures, as well as achieving the desired cluster populations.

Keywords: Fuzzy C-means, Clustering, Cluster size insensitivity

1. Introduction

Clustering is a method for grouping data objects into clusters, such that objects in the same cluster have similar characteristics, whereas those in different clusters have disparate characteristics [1]. However, there is no absolute rule for defining cluster characteristics, which is why different clustering techniques have been proposed on the basis of various inclusion principles [2]. According to Fraley and Raftery [3], clustering methods can be categorized as hard clustering or soft (fuzzy) clustering. Hard clustering methods, such as K-means, strictly allocate each object to only one cluster, so these methods require clear cluster boundaries. However, in real life, data objects may naturally belong to either single or multiple groups, which limits the application of hard clustering [4]. Therefore, fuzzy clustering techniques have been developed [5]. Fuzzy clustering is more flexible because it allows every object to belong to more than one cluster [6]. Fuzzy C-means (FCM) is a widely used partitioning-based fuzzy clustering method that divides a set of data objects into fuzzy partitions by minimizing the intra-cluster variance and maximizing the inter-cluster variance in accordance with an objective function [7].

Because of the flexibility involved with acting on overlapped datasets, FCM and its expansions have been widely used in diverse fields such as image processing and market segmentation [810].

However, a common failure of partition-based clustering methods is their inability to deal with datasets of varying cluster size and data density, which we call the “size-insensitivity problem.” As the most commonly used partition-based clustering method, FCM is undoubtedly not immune to this issue [11].

For one thing, as a Euclidean distance-based method, FCM ignores the scale difference in different dimensions of input data objects, resulting in weak discernment of diverse-distribution data structures. To resolve this defect, Mahalanobis distance has been added to the original FCM by many researchers [1214]. Using the covariance matrix to evaluate the correlations among different dimensions makes such FCM extensions better at extracting data structures with various distributions. However, they sometimes fail when the difference in the distribution is small. In a similar way, Gaussian-based models are also good at distinguishing the distinctions of data distributions [15]. Ichihashi et al. [16] found that when setting a fuzzifier at a particular value, FCM achieves the same effect as Gaussian mixture density decomposition (GMDD) by regularizing the K-L information (KFCM) [17, 18]. KFCM is a partition-based improvement that solves the size-insensitive problem to some degree, but it is limited by the boundary conditions and cannot deal with clusters firmly located next to each other.

Moreover, because FCM uses a sum-of-squared-error objective function to obtain solutions, it tends to drift centers of smaller clusters toward more massive adjacent clusters and to equalize cluster populations. To overcome this problem, the following algorithms have been proposed: fuzzy maximum likelihood estimation (FMLE) [19], semi-supervised FCM (ss-FCM) [20], cluster-size-insensitive FCM (CSI-FCM) [21], and size-insensitive integrity-based FCM (SIIB-FCM) [22]. Lin et al. [22] showed that SIIB-FCM is superior to the other three, so we will not repeat the details here. However, SIIB-FCM is quite sensitive to the distances among clusters, which means it is only good at handling datasets with relatively clear cluster boundaries. This shortcoming is apparent when dealing with practical issues, such as removing small amounts of noise mixed in with the raw signal.

Considering the weaknesses mentioned above, a more effective method is needed to solve the size-insensitivity problem of FCM. In an actual application, one can sometimes obtain background information concerning data objects, such as the expected number of data groups and the approximate size of each group, from preliminary surveys, prior studies, or the opinions of domain experts in actual applications of data clustering. The proportion of each cluster size, as well as the number of clusters, may well be available as a priori knowledge for clustering. Clustering is based on a similarity measure to group semblable data objects together, so it is commonly utilized in areas such as market segmentation, vehicle routing selection, and healthcare problems [23, 24]. On these occasions, users employ certain conditions for various classification purposes. As a result, if such additional information is introduced as a priori information, it will likely help clustering algorithms find the intrinsic structure behind observations.

In our previous studies [25, 26], we developed an algorithm for FCM to assimilate information regarding the desired cluster sizes by refining the cluster membership functions. This method has demonstrated advantages over SIIB-FCM. In this paper, we extend the wrapper algorithm by defining more convincing adjustment directions by introducing the Mahalanobis distance.

The significant contributions of this study are as follows. 1) The present paper proposes a new, improved fuzzy clustering method to solve the size-insensitivity problem. 2) The proposed method is good at extracting data structures, regardless of the cluster numbers or data dimensions, and can deal with multi-distribution datasets. 3) Unlike other similar algorithms, the proposed CSCD-FCM is not sensitive to cluster distance, which widens its application range. 4) The proposed method can correct the error input of the priority information itself, so its application is soft and flexible.

The remainder of this paper is organized as follows. In Section 2, we review the FCM. Section 3 proposes a new algorithm to assimilate a priori knowledge of the cluster size proportions into fuzzy clustering. Section 4 presents experiments to show the advantages of the proposed method compared with other algorithms, and Section 5 discusses the results. Finally, concluding remarks are presented in Section 6.

2. An Overview of Fuzzy C-Means

Given a dataset that contains n objects, X = {x1, x2, · · ·, xn}, the FCM algorithm attempts to minimize the following objective function:

J=j=1ni=1cuijmxj-vi||2,

where c is the number of clusters, vi is the centroid of the ith cluster, uij is the degree of membership of xj in the ith cluster, and m is the fuzzifier exponent (mR and m > 1). The fuzzifier m results in fuzzier clusters with smaller values. For every cluster, boundary conditions include i=1cuij=1 and uij ∈ [0, 1]. A solution of objective function (1) can be obtained by an iterative process that alternately updates the memberships and the cluster centers. The membership uij is obtained from a given set of cluster centers by

uij=1/k=1c(xj-vixj-vk)2m-1.

In contrast, the cluster center vi is calculated from a given set of memberships by

vi=j=1nxj*uijm/j=1nuijm.

As mentioned in Section 1, FCM attempts to minimize the objective function (1) and thus tends to give data objects a lower membership grade for clusters with higher density. As a result, smaller clusters may acquire more objects than much larger clusters, which leads to balanced sizes and may lead to misclassification of data objects. SIIB-FCM aims to solve this problem.

3. Cluster Size-Constrained Fuzzy C-Means with Density Center Searching (CSCD-FCM)

3.1 “Size-Constrained” Objective Function

The proposed method aims to adjust multi-dimensional fuzzy clusters to attain the desired population sizes. We assume that the desired proportion of cluster sizes is available as a priori knowledge for the clustering problem in question. In this premise, it is hypothesized that as the calculated cluster size approaches the given cluster population, the data structure that will be extracted is more suitable. To this end, the proposed algorithm modifies the fuzzy partitions generated by FCM-like soft clustering methods by optimizing a “size-constrained” objective function, as shown in function (4).

Jsi=|j=1nuij-Si|.

Here, Si represents the target size of cluster ci, which is constrained by i=1cSi=n, where n is the total population of data. uij is the fuzzy membership of the data object xj in ci, and j=1nuij gives the calculated size of the fuzzy cluster ci. Jsi evaluates the difference in the generated cluster population from the given cluster size Si for the ith cluster.

The method is divided into two stages. The first stage adjusts the position of each cluster while maintaining its shape, and the second stage adjusts the shape of each cluster while maintaining its center position. For a clear understanding, we present the specific procedures of the adjustment in the next section.

3.2 Procedure of CSCD-FCM

As mentioned above, the proposed method uses two-stage adjustments intended to be applied to fuzzy partitions generated by other soft clustering methods. Both stages aim to optimize function (4). Figure 1 shows the entire flow of the algorithm.

After initializing the clustering solution, the adjustment procedure for each cluster is followed, and steps 2 to 4 are repeated. The order should be from the largest cluster to the smallest one. After completing the adjustment for one cluster, the data belonging to the adjusted cluster from the input that obtained the highest membership moves to the target cluster and the remainder of the dataset moves to step 2 as the new input for finding the density center of the next cluster. The memberships used for removing data come from the new membership matrix of the adjusted cluster.

For a more fundamental understanding, we use the following example to expound.

As shown in Figure 2, the input dataset contains three clusters with discrepant data populations of 100/1500/50. The following sections describe in detail the steps mentioned above using this example.

3.2.1 Cluster prototype generation

Given a dataset that consists of n objects with p dimensions X = {x1, x2, · · ·, xn}, soft clustering methods can generate fuzzy partitions of the dataset. In the case of the FCM algorithm, a p × c matrix

V=(v1v2vc)=(v11v21vc1v12v22vc2v1pv2pvcp)

expresses the cluster prototypes, where c is the number of clusters and vi is a p-dimensional vector that represents the centroid of the ith cluster ci. With the cluster center matrix V, we can calculate the membership degree of each object xj in each cluster ci using function (2). In this paper, we assume that m is two when no domain knowledge is available.

In addition, according to function (2), every cluster ci has the highest value of membership, that is, uij= 1, at its center, but the lowest value, that is, uij= 0, at the centers of the other clusters. For convenience while explaining the following processes, we name these positions the cluster peak and floor, respectively, for each cluster. As shown in Figure 3, FCM classifies data points into three clusters with similar populations, and v1, v2, v3 makes up the cluster center matrix. Using cluster 2 as an example, the position of v2 is the peak of cluster 2, while the positions of v1 and v3 are floors, which are the inputs to the following steps.

The subsequent stages of the proposed method regard the cluster center matrix V generated by FCM-like clustering algorithms as the initial solution.

3.2.2 Density prototypes

The “cluster center” is the arithmetic mean of all the points belonging to the cluster. Each point is closer to its cluster center than to the other cluster centers. The original FCM uses Euclidean distance as the only measurement standard for the objective function, causing it to drive smaller cluster centers closer to the larger ones. This is why an equal tumble occurs. As shown in Figure 4, it is obvious that point x is closer to the center of cluster 3 than cluster 2. Thus, FCM makes an erroneous judgment on data point x by placing it in cluster 2.

The limited consideration of only one standard leads to a wrong answer for clustering. To determine the most suitable positions of cluster centers, we introduce the cluster size Si as another reference index. Under the constraints of the given “prior knowledge” we define a new concept, the density prototype pi for each cluster.

The density reflects the closeness of the data near each other in a certain area, and it can be calculated as Sizei/Volumei, which expresses the number of data per unit volume. The cluster size Si is known knowledge, so as the value of Volumei decreases, the density increases. We use the sum of distances between each data point with its Si closest neighbor points as Volumnei. Furthermore, the density center pi should be at the densest position of the ith cluster. In other words, the Volumnei center on pi should be the smallest. The density prototype of cluster i is generated by the following steps.

First, a Mahalanobis distance matrix is created using all input objects

D=(dist1(k)dist2(k)distj(k))=(0d21dj1d120dj2d1jd2j0),         (k=1,2,,j).

In function (6), each j-dimension vector distj presents the distance between each data object xj and all the other objects. For a natural expression, djj, the distance to itself, is also presented in the matrix, which has a value of 0. The dkj Mahalanobis distances are shown in function (7).

dkj=(xk-xj)TS-1(xk-xj),

where S is the covariance matrix of xk and xj.

For each distance vector distj, we find the number of Si smallest components that have xjSi neighbor data objects nearest to it. Then, we make a sum of these Si components of distj and record it as Volumnej. Furthermore, we determine the density prototypes pi = xp for the ith cluster by seeking the data object xp that has the minimum value of Volumnej, j = 1, 2, . . ., n. As mentioned above, pi is considered the density prototype because compared with other data objects, the Si data objects around it are always the most compact, regardless of the numerical value or distribution features. pi is an appropriate symbol for defining the search direction.

Here, what should be given attention is that the position of density prototype pi is always different from the cluster center. The density prototype pi is used as a benchmark in the next two steps, which guide the cluster centers toward a high-density area. Hence, the cluster center appears on the line of the moving direction but may not coincide with the density prototype. As shown in Figure 4, the cross marks present some typical data points in cluster 2. We assume that all other data points are distributed uniformly. The result of the “cluster center” position of cluster 2 given by the proposed method will not be p2 partial to data point x.

3.2.3 Cluster position adjustment

The first stage of adjustment is intended to determine the desired position of each cluster to obtain the target cluster sizes. In this stage, the algorithm explores new positions for both the peak and floor points of each cluster under constraints that limit the search directions.

We first duplicate the cluster center matrix V for each cluster. Let us here define the cluster ci’s duplicates of the matrix V as V(i). The ith column of V(i) represents the peak of the cluster ci, and the other columns represent its floor points. Conversely, the search direction of cluster ci is constrained by pointing from the peak to pi. During the search, the cluster peak and floor positions will translate together in the same direction. The length between the peak and pi determines the search interval. In our experience, the searching distance should be over 1.5 multiples of the peak ~ pi distance to obtain a broader searching range, which depends on the fuzzy degree of the dataset. Fuzzier data require a smaller search distance. However, the multiple does not need to be set to more than 3, which wastes calculation time. The multiple of the peak ~ pi distance is recommended to be 2 if there is no domain knowledge. Figure 5 shows the adjustment procedure for cluster 2 in this step.

In Figure 5, the adjusting direction of cluster 2 is from v2 pointing to p2. The searching distance is double the distance between v2 and p2. The peak and floors are moved together in the same direction and searching distance.

As the search direction and the interval are specified, the algorithm attempts to minimize objective function (4) for the chosen cluster. The search updates the peak/floor matrix of the cluster V(i). As shown in Figure 5, the black points are the results of this procedure for cluster 2.

3.2.4 Cluster shape adjustment

The second stage of adjusting the membership functions changes the shape of each cluster to minimize objective function (4) by moving the clusters’ floor positions while fixing their peak positions.

Given peak/floor matrixes V(i) obtained from the first stage, the algorithm attempts to search for a better floor point placement for each cluster ci. During the search, all the floor points of the target cluster are translated together. The searching direction and the interval are the same as in the first stage.

As shown in Figure 6, during the shape adjustment procedure, the peak v2(2) of cluster 2 is kept flexible and the floors v1(2) and v3(2) are adjusted in the same direction as in the position adjustment step.

3.2.5 Generate new membership functions

The result of the two-stage adjustment is a set of updated peak/floor matrixes V(i) for all clusters. By applying function (2), we can obtain different membership matrices U(i) = (uij) 1 ≤ ic, 1 ≤ jn from each peak/floor matrix V(i). The proposed method creates a net membership matrix with the ith row out of U(i) arranged one over the other, as shown below:

U=(U1(1)U2(2)U3(3))=(u11(1)u12(1)u1n(1)u21(2)u22(2)u2n(2)uc1(c)uc2(c)ucn(c)).

At the same time, we combine the peak vectors of each V(i) as the new cluster center matrix:

V_new=(V1(1)V2(2)Vc(c)).

Figure 7 shows the results of V_new and the classification of the proposed algorithm that we used as an example.

4. Experiment

We use numerical and practical experiments to examine the effectiveness of the proposed method in comparison with FCM, SIIB-FCM, and KL-FCM. To test the stronger points of the proposed algorithm, we conducted four experiments. The aims and contents of the four experiments are shown in Table 1.

4.1 Capacity for Correctly Extracting Data Structures

In this experiment, a numerical dataset is prepared to test the correct clustering power of the proposed method by evaluating the error rate of the data population classified into each cluster and the accuracy of the clustering results. Additionally, for the numerical dataset, we also evaluated the indexes typically used for soft clustering methods: Dunn’s index (DI) and the Xie-Beni index (XB).

The experiment used artificial datasets with different cluster populations. Section 3 uses a three-cluster dataset as an example to explain the algorithm. Here, we created another dataset to test the effectiveness of the proposed algorithm under different data distributions. Figure 8 shows the original distributions of the input dataset.

For this test, there are three cross-test datasets, and the size of the cluster is set as 2000/50. Figure 9 shows the results.

KL-FCM, a segment-based clustering method using the Mahalanobis distance, highlights its advantages when input data have diverse distributions. The proposed algorithm also draws on the advantage of the Mahalonobis distance in the stage of finding the “density center.” Thus, CSCD-FCM is tolerant to multi-distribution data to some degree.

Tables 2 and 3 show the related evaluation indices.

Here, SIIB-FCM shows the lowest XB value on this occasion, which is primarily a result of the reduction of memberships for all the data. However, compared with the accuracy of extracting the exact data structure, the decrease in the XB index does not make sense. Considering the size-insensitivity problem, CSCD-FCM performs best here.

4.2 Clustering Capacity with Varying Distances among Clusters

SIIB-FCM and KL-FCM were proposed to improve FCM performance on a clustering dataset with unbalanced cluster sizes and different data distributions. Both methods perform well when clusters are sufficiently far apart, but not when clusters are close together. We call this the “distance-sensitivity problem.” This problem commonly occurs in unsupervised clustering methods, especially when the cluster sizes are unbalanced. The following experiments focus on the distance-sensitivity problem, and the results show that the proposed method has a high tolerance with respect to the compactness of clusters.

In this experiment, different datasets were generated for two circular clusters using normal-distribution data with cluster sizes of 2000/50 and various cluster distances, and a three-dataset validation was conducted for each kind of input. The center distance of the circle measures the distance between clusters.

We used the accuracy and F1_score to evaluate the clustering possibility of the four algorithms under close to far cluster distances. Figures 10 and 11 show the accuracy and F1_scores, respectively, of FCM, SIIB, KL-FCM, and CSCD-FCM with various cluster distances.

The accuracy and F1_score reflect the degree of correct classification. Figures 10 and 11 show that the size-insensitivity problem exists in FCM, regardless of the distances among the clusters. When the interval between the two clusters is sufficiently large, e.g., over 4000, as shown in this example, KL-FCM and SIIB-FCM can obtain better results than FCM. However, when the clusters are closer together, KL-FCM and SIIB-FCM sometimes perform worse than the original FCM.

In contrast, the proposed algorithm has almost no sensitivity to distance. This means that even if the data distribution is very tight, the clustering result is still good. This strong point may make CSCD-FCM more widely applicable than similar algorithms.

4.3 Robustness of the Algorithm

This experiment is aimed at testing whether the proposed algorithm can obtain adequate results when the cluster size information contains errors. The datasets used in this experiment were still generated in two circular clusters using normal-distribution data with cluster sizes of 2000/50. We evaluated the accuracy and F1_score by fixing one cluster’s size, which is the correct size information, and reducing or increasing the other clusters’ sizes by a certain percentage of the actual size. Figures 12 and 13 show the results.

The results show that the proposed algorithm is robust to the error input of size information. The algorithm still obtains both accuracy and F1_score greater than 0.95 when the given information reduces the larger cluster to 70% of its actual size and increases the smaller cluster to 800% of its actual size. Thus, the proposed algorithm has a degree of tolerance to the “prior information error.” Considering this, when using the algorithm to deal with a practical problem, the requirements for prior input knowledge regarding size information are not strict. An inevitable ambiguous error is permitted.

4.4 A Practical Example

In this experiment, we conducted a practical application on a healthcare problem. Obstructive sleep apnea (OSA) is a common sleep disorder disease, especially in elderly individuals [27]. According to the most recent survey from the American Academy of Sleep Medicine (AASM), the sleep health research authority, 71.9% of the general population aged 40–85 years suffers from this disease [27]. Because people are unconscious when sleeping, it is difficult to discover OSA in the primary care stage. A commonly used means for self-diagnosing OSA is the screening tool, and the STOP-Bang questionnaire is the most well-known [28].

There are eight questions in the STOP-Bang questionnaire. We collected 3,931 people’s STOP-Bang data and Apnea-Hypopnea Index (AHI) from the Sleep Heart Health Study dataset [29], where AHI is regarded as the gold standard indicator for diagnosing OSA. The participants’ age ranged from 39 to 95 years. None of these 3,931 people had been diagnosed with OSA by a medical doctor or accepted an OSA treatment before collecting the data. An AHI ≥ 5 is a critical value to judge OSA, and there are 2,730 objects, occupying approximately 70% of the 3,931 people who should be diagnosed with OSA. We applied the algorithm to 8-dimensional input data reflecting the eight questions from the 3,931 objects with the prior knowledge of 2,730 objects for the positive cluster and 1,201 objects for the negative cluster. The clustering results were evaluated by comparing the AHI value judgments. For a healthcare problem, medical doctors are generally more concerned more about the sensitivity of the positive rate. Hence, this index is introduced as another evaluation indicator, in addition to accuracy and F1_score. The clustering results using the four methods are shown in Table 4.

The Cp in Table 4 presents a positive cluster, and Cp size is recorded as the data population classified into the positive cluster (differences from the true cluster size, 2,730). Table 4 clearly shows that with the constraint of the prior knowledge, the difference between the data populations classified into each cluster of the proposed method and the given size is smallest, and the accuracy and sensitivity are also much higher than those of the other methods.

5. Discussion

The size-insensitivity problem is common in objective function-based clustering methods such as FCM. FCM clusters data objects by minimizing the sum of Euclidean distance errors among data, which causes data objects at the edge of larger clusters to drag the center of adjacent smaller clusters toward itself. This movement leads to the clustering results of FCM often having balanced sizes.

To solve this size-insensitivity problem, improved methods such as SIIB-FCM and KL-FCM introduce extra information, such as data populations and distribution characteristics, to correct the shortcomings caused by the objective functions based on the Euclidean distance. Nevertheless, the only objective function that these methods attempt to optimize is still the sum of variance errors of all data points, which leads to them being sensitive to the distance between neighboring clusters. This kind of simultaneous use of a variety of information causes different types of information to interfere with each other; thus, the algorithms cannot fully mine all the information carried by the data. The proposed method divides the optimization procedure into two stages: one for variance error optimization, and the other for extra-knowledge optimization. The segmented use of the objective functions helps the proposed algorithm better utilize the necessary information hidden in the data.

Additionally, attaching extra knowledge as clustering data objects may cause unnecessary biases in the original data structure. For one thing, the a priori knowledge utilized by the algorithm is not groundless rumors. It must be the knowledge carried by the input data objects, and the use of this knowledge should not affect the hidden information in the dataset that the algorithm hopes to extract. In addition, the proposed algorithm has some tolerance to errors in the input information of cluster size, which will help reduce the influence of the biases in application cases.

6. Conclusion

The size-insensitivity problem is one of the significant shortcomings of traditional FCM and its variations, resulting in a weak clustering probability when dealing with unbalanced datasets. Thus, this paper proposes an improved fuzzy clustering method based on the assumption of obtaining “cluster sizes” as a priori information and subjoins a size objective function to take advantage of the information lost in traditional FCM. The algorithm contains a two-stage adjustment. One is the position adjustment, to find the most suitable location for each cluster. The other is a further adjustment that continues to optimize the size objective function by changing the cluster shape. Additionally, utilizing the Mahalanobis distance to define the moving directions during the adjustment procedure enhances the capacity of the algorithm to deal with multi-distribution datasets. Compared with other algorithms aimed at solving the same problem, such as KL-FCM and SIIB-FCM, the proposed method can extract the actual data distribution more correctly and has a high tolerance to cluster distance. Additionally, the proposed CSCD-FCM can offer the correct clustering solution, even when the input information contains errors.

In this study, the proposed method is proven to have the ability to solve clustering problems with different distributions of data to some degree. However, when the cluster shape becomes concave, the algorithm performs unsatisfactorily. In the future, we would like to combine the CSCD-FCM with fuzzy variation strength in dealing with arbitrary shape clusters to increase its application range.

Fig 1.

Figure 1.

Overall flow of the algorithm.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Fig 2.

Figure 2.

Data distribution of the input dataset.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Fig 3.

Figure 3.

FCM results and peak and floor of cluster 2.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Fig 4.

Figure 4.

Density center of cluster 2.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Fig 5.

Figure 5.

Cluster position adjustment of cluster 2.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Fig 6.

Figure 6.

Cluster shape adjustment of cluster 2.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Fig 7.

Figure 7.

Final results.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Fig 8.

Figure 8.

Numerical dataset with two clusters and different data distributions.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Fig 9.

Figure 9.

Clustering results using the four algorithms: (a) FCM, (b) SIIB-FCM, (c) KL-FCM, and (d) CSCD-FCM.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Fig 10.

Figure 10.

Comparison of the accuracy of four algorithms with various distances.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Fig 11.

Figure 11.

Comparison of the F1_score of four algorithms with various distances.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Fig 12.

Figure 12.

Changes in indices as cluster size decreases.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Fig 13.

Figure 13.

Changes in indices as cluster size increases.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 346-357https://doi.org/10.5391/IJFIS.2020.20.4.346

Table 1 . Aims and contents of the four experiments.

Contents
Experiment 1Data structure extraction test
Experiment 2Distance tolerance test
Experiment 3Robustness test
Experiment 4Practical example of a healthcare problem

Table 2 . Cluster size results for two clusters with different distributions.

MethodCluster 1Cluster 2
SizeDifferenceSizeDifference
FCM1177±26823±26923±26−823±26
SIIB-FCM807±1051193±1051293±105−1193±105
KL-FCM1582±13418±13518±13−418±13
CSCD-FCM2000±10±1100±10±1

Table 3 . Evaluation indices for two clusters with different distributions.

MethodAccuracyF1_scoreDIXB
FCM0.6080±0.010.6528±0.00330.0043±0.00160.1532±0.0032
SIIB-FCM0.4321±0.050.6095±0.01190.0027±0.00130.0428±0.0031
KL-FCM0.7171±0.130.6922±0.0350.0016±0.000172.3254±58.074
CSCD-FCM0.9998±0.00030.9991±0.00150.303±0.25650.1868±0.0086

Table 4 . Evaluation indices for the practical example.

MethodAccuracyF1_scoreSensitivityCp size
FCM0.38850.35310.46082190(−540)
SIIB-FCM0.57850.55170.61502285(−445)
KL-FCM0.48610.55230.37951362(−1368)
CSCD-FCM0.72020.65520.83592934(+204)

References

  1. Rokach, L, and Maimon, O (2005). Clustering methods. Data Mining and Knowledge Discovery Handbook. Boston, MA: Springer, pp. 321-352 https://doi.org/10.1007/0-387-25465-X_15
    CrossRef
  2. Maimon, O, and Rokach, L (2005). Data Mining and Knowledge Discovery Handbook. Boston, MA: Springer
    CrossRef
  3. Fraley, C, and Raftery, AE (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal. 41, 578-588. https://doi.org/10.1093/comjnl/41.8.578
    CrossRef
  4. Nayak, J, Naik, B, and Behera, H (2015). Fuzzy C-means (FCM) clustering algorithm: a decade review from 2000 to 2014. Computational Intelligence in Data Mining. New Delhi, India: Springer, pp. 133-149 https://doi.org/10.1007/978-81-322-2208-8_14
  5. Banu, PN, and Andrews, S (2015). Performance analysis of hard and soft clustering approaches for gene expression data. International Journal of Rough Sets and Data Analysis. 2, 58-69. https://doi.org/10.4018/ijrsda.2015010104
    CrossRef
  6. Bora, DJ, Gupta, D, and Kumar, A. (2014) . A comparative study between fuzzy clustering algorithm and hard clustering algorithm. Available https://arxiv.org/abs/1404.6059
  7. Bezdek, JC, Ehrlich, R, and Full, W (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences. 10, 191-203.
    CrossRef
  8. Mohamed, NA, Ahmed, MN, and Farag, A . Modified fuzzy c-mean in medical image segmentation., Proceedings of 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat No 99CH36258), 1999, Phoenix, AZ, Array, pp.3429-3432. https://doi.org/10.1109/ICASSP.1999.757579
  9. Ma, Z, Tavares, JMR, Jorge, RN, and Mascarenhas, T (2010). A review of algorithms for medical image segmentation and their applications to the female pelvic cavity. Computer Methods in Biomechanics and Biomedical Engineering. 13, 235-246. https://doi.org/10.1080/10255840903131878
    CrossRef
  10. Hsu, TH (2000). An application of fuzzy clustering in group-positioning analysis. Proceedings National Science Council of the Republic of China. 10, 157-167.
  11. Saxena, A, Prasad, M, Gupta, A, Bharill, N, Patel, OP, Tiwari, A, Joo, EM, Ding, W, and Lin, CT (2017). A review of clustering techniques and developments. Neurocomputing. 267, 664-681. https://doi.org/10.1016/j.neucom.2017.06.053
    CrossRef
  12. Huang, SF, Lin, YH, Yih, JM, and Tseng, JC (2018). Fuzzy clustering algorithm based on mahalanobis distances with recursive process. International Journal of Intelligent Technologies & Applied Statistics. 11, 221-239. https://doi.org/10.6148/IJITAS.2017.1004.02
  13. Heidari, H, Moslehi, Z, Mirzaei, A, and Safayani, M (2018). Bayesian distance metric learning for discriminative fuzzy c-means clustering. Neurocomputing. 319, 21-33. https://doi.org/10.1016/j.neucom.2018.08.071
    CrossRef
  14. Haldar, NAH, Khan, FA, Ali, A, and Abbas, H (2017). Arrhythmia classification using Mahalanobis distance based improved fuzzy C-means clustering for mobile health monitoring systems. Neurocomputing. 220, 221-235. https://doi.org/10.1016/j.neucom.2016.08.042
    CrossRef
  15. Zhuang, X, Huang, Y, Palaniappan, K, and Zhao, Y (1996). Gaussian mixture density modeling, decomposition, and applications. IEEE Transactions on Image Processing. 5, 1293-1302. https://doi.org/10.1109/83.535841
    Pubmed CrossRef
  16. Ichihashi, H, Miyagishi, K, and Honda, K . Fuzzy c-means clustering with regularization by KL information., Proceedings of the 10th IEEE International Conference on Fuzzy Systems (Cat No 01CH37297), 2001, Melbourne, Australia, Array, pp.924-927. https://doi.org/10.1109/FUZZ.2001.1009107
  17. Dempster, AP, Laird, NM, and Rubin, DB (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological). 39, 1-22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
  18. McLachlan, GJ, and Krishnan, T (1997). The EM Algorithm and Extensions. New York, NY: John Wiley & Sons
  19. Gath, I, and Geva, AB (1989). Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 11, 773-780. https://doi.org/10.1109/34.192473
    CrossRef
  20. Bensaid, AM, Hall, LO, Bezdek, JC, and Clarke, LP (1996). Partially supervised clustering for image segmentation. Pattern Recognition. 29, 859-871. https://doi.org/10.1016/0031-3203(95)00120-4
    CrossRef
  21. Noordam, JC, Van Den Broek, WHAM, and Buydens, LMC (2002). Multivariate image segmentation with cluster size insensitive fuzzy C-means. Chemometrics and Intelligent Laboratory Systems. 64, 65-78. https://doi.org/10.1016/S0169-7439(02)00052-7
    CrossRef
  22. Lin, PL, Huang, PW, Kuo, CH, and Lai, YH (2014). A size-insensitive integrity-based fuzzy c-means method for data clustering. Pattern Recognition. 47, 2042-2056. https://doi.org/10.1016/j.patcog.2013.11.031
    CrossRef
  23. Irani, J, Pise, N, and Phatak, M (2016). Clustering techniques and the similarity measures used in clustering: a survey. International Journal of Computer Applications. 134, 9-14. https://doi.org/10.5120/ijca2016907841
    CrossRef
  24. Jonietz, D, Bucher, D, Martin, H, and Raubal, M (2018). Identifying and interpreting clusters of persons with similar mobility behaviour change processes. Geographic Technologies for All. Cham, Switzerland: Springer, pp. 291-307 https://doi.org/10.1007/978-3-319-78208-9_15
    CrossRef
  25. Li, J, Horiguchi, Y, and Sawaragi, T . Refining fuzzy c-means membership functions to assimilate a priori knowledge of cluster sizes., Proceedings of 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems (ISIS), 2018, Toyama, Japan, Array, pp.654-659. https://doi.org/10.1109/SCIS-ISIS.2018.00110
  26. Li, J, Horiguchi, Y, and Sawaragi, T . A wrapper algorithm for fuzzy cluster membership modification based on size constraints., Proceedings of the 20th International Symposium on Advanced Intelligent Systems and 2019 International Conference on Biometrics and Kansei Engineering, 2019, Jeju, South Korea, pp.122-130.
  27. Heinzer, R, Vat, S, Marques-Vidal, P, Marti-Soler, H, Andries, D, and Tobback, N (2015). Prevalence of sleep-disordered breathing in the general population: the HypnoLaus study. The Lancet Respiratory Medicine. 3, 310-318. https://doi.org/10.1016/S2213-2600(15)00043-0
    Pubmed KoreaMed CrossRef
  28. Chung, F, Abdullah, HR, and Liao, P (2016). STOP-Bang questionnaire: a practical approach to screen for obstructive sleep apnea. Chest. 149, 631-638. https://doi.org/10.1378/chest.15-0903
    CrossRef
  29. National Sleep Research Resource. (2001) . Sleep Heart Health Study dataset. Available https://sleepdata.org/datasets/shhs

Share this article on :

Related articles in IJFIS