International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(2): 156-167
Published online June 25, 2020
https://doi.org/10.5391/IJFIS.2020.20.2.156
© The Korean Institute of Intelligent Systems
Norah Ibrahim Fantoukh , Mohamed Maher Ben Ismail , and Ouiem Bchir
Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Correspondence to :
Mohamed Maher Ben Ismail and Ouiem Bchir (maher.benismail@gmail.com, ouiem.bchir@gmail.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Semi-supervised clustering relies on both labeled and unlabeled data to steer the clustering process towards optimal categorization and escape from local minima. In this paper, we propose a novel fuzzy relational semi-supervised clustering algorithm based on an adaptive local distance measure (SSRF-CA). The proposed clustering algorithm utilizes side-information and formulates it as a set of constraints to supervise the learning task. These constraints are expressed using reward and penalty terms, which are integrated into a novel objective function. In particular, we formulate the clustering task as an optimization problem through the minimization of the proposed objective function. Solving this optimization problem provides the optimal values of different objective function parameters and yields the proposed semi-supervised clustering algorithm. Along with its ability to perform data clustering and learn the underlying dissimilarity measure between the data instances, our algorithm determines the optimal number of clusters in an unsupervised manner. Moreover, the proposed SSRF-CA is designed to handle relational data. This makes it appropriate for applications where only pairwise similarity (or dissimilarity) information between data instances is available. In this paper, we proved the ability of the proposed algorithm to learn the appropriate local distance measures and the optimal number of clusters while partitioning the data using various synthetic and real-world benchmark datasets that contain varying numbers of clusters with diverse shapes. The experimental results revealed that the proposed SSRF-CA accomplished the best performance among other state-of-the-art algorithms and confirmed the outperformance of our clustering approach.
Keywords: Semi-supervised clustering, Relational data, Fuzzy clustering, Local distance measure learning, Optimal number of clusters
Clustering is one of the most popular unsupervised learning techniques that are commonly used in data mining and pattern recognition fields [1, 2]. The resulting categories include sets of homogeneous patterns [1]. Accordingly, the distances between the data instances that belong to the same cluster exhibit high similarity to each other compared to those from other clusters. Clustering can be perceived as a data modeling technique that yields concise data summarization. Recently, clustering approaches have gained attention because they play a key role in a broad range of applications. In fact, it has become relevant to various contexts and disciplines including artificial intelligence, pattern recognition, information retrieval, image analysis, and bioinformatics [3].
Clustering is typically an NP-hard problem where the unsupervised learning task uses no prior information on the clusters/categories or on the relationship between the data points [4]. Furthermore, the same dataset may require different categorizations based on its application purposes. The prior information can be integrated to steer the clustering process and yield significant performance improvement. Clustering algorithms that employ both unlabeled and labeled data as prior knowledge are called semi-supervised clustering algorithms. Recently, researchers from data mining and pattern recognition fields have shown considerable interest in semi-supervised clustering [5, 6].
Another limitation of typical clustering algorithm is their inability to handle complex data using cluster prototypes. In fact, conventional clustering approaches that rely on cluster prototypes cannot be used with all datasets. In particular, relational data, which is commonly encountered in fields where the representation of individual data instances using low-level features is not possible, cannot be handled by such prototype-based clustering algorithms. A relational clustering technique [7] should be employed when the data is purely relational and only pairwise relations (similarities or dissimilarities) between pairs of points are known. However, relational clustering has received considerably lesser attention than prototype-based clustering approaches. This can be attributed, in part, to the fact that most engineers and mathematicians often deal with object data, and infrequently deal with purely relational data.
Moreover, hard/crisp clustering yields distinct clusters where each data instance belongs exclusively to one cluster. However, real datasets usually include overlapping clusters. Consequently, it is more convenient to use fuzzy clustering [8, 9] that allows data points to simultaneously belong to multiple clusters with different degrees of membership.
Another major issue is the difficulty in specifying the optimal number of clusters. Various clustering techniques determine this parameter by repeating the clustering process using different number of clusters and then selecting the value that yields the best cluster validity measure [10]. Such an approach is impractical for large datasets owing to its expensive computational cost. Despite the considerable effort made by the research community to introduce more advanced and accurate clustering algorithms [3], determining the optimal number of clusters in an automatic manner remains a challenging problem that affects the performance of most state-of-the-art techniques.
The main contribution of this work is the introduction of a novel semi-supervised fuzzy relational clustering algorithm, based on local distance learning that automatically determines the optimal number of clusters. In addition, the supervision information used in this approach steers the clustering process towards the optimal partition and escape from the local minima. Additionally, the proposed method is formulated to handle relational data. This makes it useful for real applications where only pairwise similarities (or dissimilarities) between data instances are available.
We dedicate this section to survey the most related works to our proposed approach. We partition this survey into two main categories based on the technical foundations of the published works. The first part of this section focuses on the different studies relevant to semi-supervised fuzzy clustering with measure learning. The approaches that form the second part enclose several methods that can be adopted to obtain the optimal number of clusters for a given dataset.
The research in [11] introduced the metric learning and pairwise constraints K-means (MPCK-means) algorithm, which is a semi-supervised clustering algorithm that unifies constraint-based and measure-based techniques. It is an extension of the well-known K-means algorithm [1] that employs both labeled and unlabeled data to guide the clustering and learn the distance measure simultaneously. However, this approach allows only linear mapping for the input data. Therefore, this clustering method cannot separate clusters that are nonlinearly separable.
In [12], an adaptive semi-supervised clustering kernel method (SCKMM) has been proposed. This study has proposed a kernel semi-supervised clustering algorithm, which uses an adaptive version of K-means that incorporates the pairwise constraints along with measure learning into a nonlinear framework. Although the experimental results on various datasets proved the effectiveness of the SCKMM, it remains practical with small datasets only owing to the computational complexity of solving the measure matrix.
In [13], a relational fuzzy semi-supervised clustering method with local distance measure learning (SURF-LDML) has been introduced. The supervisory information is integrated to guide the clustering process towards target categorization and to avoid the local minima. The side-information is also exploited to learn the underlying distance measure while categorizing the data. Moreover, SURF-LDML is formulated to work with relational data. However, SURF-LDML may not be suitable when the distribution of the different clusters in the input space exhibits large variations.
More recently, a novel clustering algorithm named fuzzy clustering with learnable cluster-dependent kernels (FLeCK) has been proposed [14]. The proposed algorithm learns the local Gaussian kernel parameters while categorizing the data. FLeCK is also designed to work with relational data. However, all the above state-of-the-art approaches assume that the number of clusters is given before starting the clustering process based on some experience or domain knowledge, which is considered a significant limitation that affects them in real-world clustering applications.
The competitive agglomeration (CA) [15] is an approach that can learn the number of clusters in an unsupervised way. The CA algorithm is an iterative process where the number of clusters decreases gradually with the number of iterations. It starts the clustering process with the maximum number of clusters and the final categorization represents the optimal number of clusters. The core challenge related to the CA algorithm is that it is prototype-based, and thus, not suitable for applications where the data points are not expressed using feature vectors. The possible alternative to overcome this drawback is to use relational data, which consists of the pairwise relations between each pair of points.
The CA for relational data (CARD) clustering algorithm [16], is an extended version of the fuzzy CA [15] that can deal effectively with complex, non-Euclidean relational data. The CA and CARD [16] algorithms can learn the optimal number of clusters in an unsupervised manner. However, unsupervised clustering is an optimization problem that is prone to many local minima [17]. A possible alternative to overcome this drawback can be accomplished by including pairwise constraints along with the unlabeled data to steer the learning task towards the optimal categorization and escape from local minima.
The semi-supervised fuzzy clustering algorithm with pairwise constraints (SCAPC) [18] was introduced to improve the typical semi-supervised fuzzy clustering with pairwise constraints. A novel penalty term was incorporated to the CA [15] objective function to discover accurately the optimal partition of the data by moderately changing the disagreement on the magnitude order between the penalty term and original objective function. Although the side information in the SCAPC approach is used to bias the clustering algorithm towards an optimal partitioning, it does not learn and adapt the underlying distance measure. Moreover, the performance of the clustering is significantly sensitive to the choice of the distance metric [19].
To overcome the above limitations, we introduce a novel semi-supervised relational fuzzy clustering algorithm with measure learning based on CA (SSRF-CA). The proposed algorithm is an extension of the SURF-LDML introduced in [13]. More precisely, we intend to exploit the CA [15] approach to learn the number of clusters while categorizing the data. The proposed SSRF-CA uses side information to penalize or reward the objective function to learn a better partition. Moreover, it learns the dependent dissimilarity measure with respect to each cluster.
Given the dataset
subject to
In
The distance
where
SSRF-CA depends on optimizing a combined objective function with five terms (as illustrated in
Analogously, the third term is the reward of satisfying a CL constraint. It is formulated in such a way that the reward among distant CL points is higher than the nearby points. This reward term is also weighted by the fuzzy memberships. If the distance among two CL points is high, then the learned distance measure for this cluster is suitable, and the reward should be increased to permit the modification.
The fourth term of the objective function in
The last term in
The weights
The proposed approach can handle both data matrices of the pairwise distance or pairwise dissimilarity. The method that we apply in order to achieve this goal is inspired by Hathaway et al. [20] where the relational version of the popular fuzzy C-mean (FCM) was introduced [9]. It has been proved in [20] that the squared Euclidean distance
where
In
The aim of the proposed SSRF-CA algorithm is to learn the fuzzy membership values
where
and
The first term in
and
Furthermore, to optimize the objective functions in
where
SSRF-CA is initialized with a large number of clusters
In
It is important to note that when a data point
To put it simply, if a data point
The value of
In
where
The SSRF-CA algorithm is summarized as Algorithm 1.
The computational complexity of the proposed SSRF-CA is
In the following, we first describe the considered datasets and experimental settings. Then, we introduce the performance metrics used to assess and analyze the obtained results. Finally, we prove the success of the proposed SSRF-CA and compare its performance to the relevant clustering algorithms below:
The CA [15]: The prototype-based clustering approach, outlined in Section 2.2, can learn the optimal number of clusters in an unsupervised way. The pairwise constraints, ML and CL, are ignored in this algorithm.
The SCAPC [18]: A semi-supervised version of the CA algorithm, depicted in Section 2.2, uses the pairwise constraints, ML and CL, to provide partial supervision.
We compare the proposed SSRF-CA with relevant clustering approaches using different synthetic and real-world benchmark datasets provided by the UCI machine-learning repository.
As the proposed algorithm is designed as a semi-supervised clustering technique, we randomly keep 10% of the labeled data points (called seed points) for guiding the learning process. In particular, the pairs of seed points residing in the same cluster compose the ML subset. Likewise, the pairs of seed points belonging to different clusters compose the CL subset. Tables 1 and 2 summarize the synthetic and real datasets, respectively.
The performance of the clustering algorithm is significantly sensitive to the choice of the parameter
For the proposed SSRF-CA, the two regularization parameters,
It is important to note that because of the bias term, the membership values,
To evaluate the performance of the proposed SSRF-CA and compare it with the considered relevant clustering algorithms, we assume that the true labels are given. Then, we calculated the clustering accuracy, Rand index, Jaccard coefficient, Fowlkes-Mallows [23], and the learned number of clusters to reflect the overall performance of each algorithm.
In this experiment, we illustrate how the new mapping enhances the clustering performance even when we include the new bias term in the objective function to learn the optimal number of clusters. We have chosen dataset 2 from the synthetic data to illustrate the learned distance measure because it includes clusters of diverse number of instances, shapes, and orientations.
In Figure 1, the small circles are the data points and each cluster is illustrated using a different color. The black color represents the 10% seed points used for supervision. For the visualization and interpretation of the learned measures, we used the fact that learning the underlying distance measure is equal to finding a rescaling that substitutes each input data assigned to cluster with
In Figure 1(a), the data is plotted in the original feature space, which means that the
This experiment focuses on how the proposed SSRF-CA is able to learn the optimal number of clusters that reflects the correct partition of the data. Specifically, we compare the results obtained using our proposed approach with those obtained using the state-of-the-art clustering approaches for the same synthetic and real datasets introduced in Section 4.1.
Figures 2
The geometric characteristics of dataset 2 renders it slightly difficult to categorize and learn the optimum number of clusters (as indicated in Figure 3(a)). The CA algorithm [15] was not able to partition this data and has brought the algorithm toward a local minimum. The CA algorithm has merged all the dataset in one cluster, as illustrated in Figure 3(b) because it cannot identify the circular-shaped clusters. Furthermore, integrating partial supervision in the SCAPC algorithm [18] did not provide significant help. The proposed SSRF-CA, on the other hand, overtakes all the other algorithms in partitioning the data and learning the optimal number of clusters despite the complexity of the geometry of this data (see Figure 3(d)). The proposed SSRF-CA uses the pairwise constraints information effectively to learn a cluster dependent distance measure and learn the optimal number of clusters.
For Dataset 3, defining the optimal number of clusters in this dataset is challenging as it contains four clusters with different degrees of overlapping, and various sizes and shapes that are adjacent to each other (as indicated in Figure 4(a)). We observe from Figure 4 that neither CA nor SCAPC has learned the number of clusters correctly, which is reflected from the categorization of the data. Figure 4 illustrates that SCAPC is close to CA because there is no distinct function for semi-supervised learning in SCAPC. This is mainly owing to the fact that SCAPC is using a global parameter for both ML and CL terms that makes the algorithm less effective in manipulating this data geometry. Furthermore, the proposed algorithm has failed in categorizing some data points that are at the boundaries of the clusters; however, it has learned the exact number of clusters and classified most of this dataset successfully. In fact, despite the complexity of the structure of this dataset, only the proposed SSRF-CA has relatively achieved a reasonable partition.
The fourth synthetic dataset contains five Gaussians distributed clusters that have similar shapes, densities, and balanced sizes that can be categorized by our proposed approach (see Figure 5(d)). The proposed algorithm can easily learn the optimal number of clusters while the other algorithms, CA and SCAPC, were not able to learn the exact clustering numbers and thus cannot separate the five clusters successfully. These two algorithms were prone to several local minima owing to the fact that the CA and SCAPC algorithms defined the distance measure as a priori. In fact, the performance of the clustering algorithms relies critically on the choice of the distance measure. However, the proposed SSRF-CA learned the measure by the supervision to escape from local minima and reflect the target categorization correctly.
We can conclude from Figures 2
This experiment demonstrates how our method improves the clustering performance. A summarization of the clustering results for both the synthetic and real datasets is represented in the following subsections. Figures 6 and 7 display the clustering results for the synthetic and real datasets, respectively. The statistical results prove the success of the proposed SSRF-CA regardless of the variability in cluster shape, compactness, and size. Not all the datasets can be partitioned well using the other considered clustering algorithms. Conversely, the proposed approach can perform relatively well on all the synthetic and real datasets. We have observed that the categorization performance is improved when we incorporate the bias term in spite of the variability in shape and density of each cluster. Moreover, integrating the pairwise constraints in the proposed SSRF-CA enables more effective grouping of the data and avoid the local minima. However, the pairwise constraints might include noisy constraints that negatively affect the structure of the clusters and thus mislead the categorization of the data.
In this paper, we propose a novel fuzzy relational semi-supervised clustering algorithm based on an adaptive local distance measure i.e., the SSRF-CA. The proposed clustering algorithm exploits side-information and uses it under the form of constraints to steer the learning task. Along with its ability to perform data clustering, our algorithm is designed to handle relational data. Moreover, it learns the dependent dissimilarity measure between the data instances and also learns the optimal number of clusters automatically.
In our experiments, we prove the ability of our proposed algorithm to learn the local distance measures and the optimal number of clusters while finding a compact cluster. We use various synthetic and real-world benchmark datasets that contain different number of clusters with diverse shapes. Based on the experimental results, we concluded that the proposed SSRF-CA outperforms the other state-of-the-art algorithms. This performance can be attributed to the effective use of the pairwise constraints in the proposed SSRF-CA to guide the algorithm towards the optimal partition and also to learn the underlying cluster distance measure.
Currently, we tune the parameters
No potential conflict of interest relevant to this article was reported.
Representation of the learned distance measure for dataset 2. (a) Data is plotted on the original feature space. (b) rescaled data is plotted using
Results of clustering dataset 1 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
Results of clustering dataset 2 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
Results of clustering dataset 3 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
Results of clustering dataset 4 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
Performance measures obtained on categorizing synthetic datasets: (a) dataset 1, (b) dataset 2, (c) dataset 3, and (d) dataset 4.
Performance measures obtained on categorizing real datasets: (a) bankruptcy, (b) seeds, (c) iris, and (d)Wi-Fi localization.
Table 1. Four synthetic datasets used in our experiments.
# of data points | # of clusters | Cluster sizes | # seeds per cluster | Balanced/unbalanced | |
---|---|---|---|---|---|
Dataset 1 | 87 | 2 | 43,44 | 5,5 | Unbalanced |
Dataset 2 | 238 | 3 | 100.56,82 | 10,6,9 | Unbalanced |
Dataset 3 | 185 | 4 | 51,42,53,39 | 6,5,6,4 | Unbalanced |
Dataset 4 | 250 | 5 | 50,50,50,50,50 | 5,5,5,5,5 | Balanced |
Table 2. Four real datasets used in our experiments.
# of data points | # of data attributes | # of clusters | Cluster sizes | # seeds per cluster | Balanced/unbalanced | |
---|---|---|---|---|---|---|
Bankruptcy | 250 | 7 | 2 | 142, 107 | 15, 11 | Unbalanced |
Seeds | 140 | 7 | 3 | 70,70 | 7,7 | Balanced |
Iris | 150 | 4 | 3 | 50,50,50 | 5,5,5 | Balanced |
Wi-Fi localization | 470 | 7 | 4 | 101, 108, 116, 145 | 11, 11, 12, 15 | Unbalanced |
Algorithm 1. SSRF-CA algorithm..
Fix the maximum number of clusters |
Initialize the fuzzy partition matrix |
Set |
Initialize the matrices |
Initialize the threshold |
Compute the initial cardinalities |
Compute the dissimilarity |
Compute the membership vectors |
Compute the distance |
Update |
Update the fuzzy membership using Eq. (7); |
Update the measure distance matrices |
Compute the cardinalities of all clusters |
If ( |
E-mail: n.fantoukh@gmail.com
*The photo is not included according to the author’s request.
E-mail: maher.benismail@gmail.com
E-mail ouiem.bchir@gmail.com
International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(2): 156-167
Published online June 25, 2020 https://doi.org/10.5391/IJFIS.2020.20.2.156
Copyright © The Korean Institute of Intelligent Systems.
Norah Ibrahim Fantoukh , Mohamed Maher Ben Ismail , and Ouiem Bchir
Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Correspondence to:Mohamed Maher Ben Ismail and Ouiem Bchir (maher.benismail@gmail.com, ouiem.bchir@gmail.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Semi-supervised clustering relies on both labeled and unlabeled data to steer the clustering process towards optimal categorization and escape from local minima. In this paper, we propose a novel fuzzy relational semi-supervised clustering algorithm based on an adaptive local distance measure (SSRF-CA). The proposed clustering algorithm utilizes side-information and formulates it as a set of constraints to supervise the learning task. These constraints are expressed using reward and penalty terms, which are integrated into a novel objective function. In particular, we formulate the clustering task as an optimization problem through the minimization of the proposed objective function. Solving this optimization problem provides the optimal values of different objective function parameters and yields the proposed semi-supervised clustering algorithm. Along with its ability to perform data clustering and learn the underlying dissimilarity measure between the data instances, our algorithm determines the optimal number of clusters in an unsupervised manner. Moreover, the proposed SSRF-CA is designed to handle relational data. This makes it appropriate for applications where only pairwise similarity (or dissimilarity) information between data instances is available. In this paper, we proved the ability of the proposed algorithm to learn the appropriate local distance measures and the optimal number of clusters while partitioning the data using various synthetic and real-world benchmark datasets that contain varying numbers of clusters with diverse shapes. The experimental results revealed that the proposed SSRF-CA accomplished the best performance among other state-of-the-art algorithms and confirmed the outperformance of our clustering approach.
Keywords: Semi-supervised clustering, Relational data, Fuzzy clustering, Local distance measure learning, Optimal number of clusters
Clustering is one of the most popular unsupervised learning techniques that are commonly used in data mining and pattern recognition fields [1, 2]. The resulting categories include sets of homogeneous patterns [1]. Accordingly, the distances between the data instances that belong to the same cluster exhibit high similarity to each other compared to those from other clusters. Clustering can be perceived as a data modeling technique that yields concise data summarization. Recently, clustering approaches have gained attention because they play a key role in a broad range of applications. In fact, it has become relevant to various contexts and disciplines including artificial intelligence, pattern recognition, information retrieval, image analysis, and bioinformatics [3].
Clustering is typically an NP-hard problem where the unsupervised learning task uses no prior information on the clusters/categories or on the relationship between the data points [4]. Furthermore, the same dataset may require different categorizations based on its application purposes. The prior information can be integrated to steer the clustering process and yield significant performance improvement. Clustering algorithms that employ both unlabeled and labeled data as prior knowledge are called semi-supervised clustering algorithms. Recently, researchers from data mining and pattern recognition fields have shown considerable interest in semi-supervised clustering [5, 6].
Another limitation of typical clustering algorithm is their inability to handle complex data using cluster prototypes. In fact, conventional clustering approaches that rely on cluster prototypes cannot be used with all datasets. In particular, relational data, which is commonly encountered in fields where the representation of individual data instances using low-level features is not possible, cannot be handled by such prototype-based clustering algorithms. A relational clustering technique [7] should be employed when the data is purely relational and only pairwise relations (similarities or dissimilarities) between pairs of points are known. However, relational clustering has received considerably lesser attention than prototype-based clustering approaches. This can be attributed, in part, to the fact that most engineers and mathematicians often deal with object data, and infrequently deal with purely relational data.
Moreover, hard/crisp clustering yields distinct clusters where each data instance belongs exclusively to one cluster. However, real datasets usually include overlapping clusters. Consequently, it is more convenient to use fuzzy clustering [8, 9] that allows data points to simultaneously belong to multiple clusters with different degrees of membership.
Another major issue is the difficulty in specifying the optimal number of clusters. Various clustering techniques determine this parameter by repeating the clustering process using different number of clusters and then selecting the value that yields the best cluster validity measure [10]. Such an approach is impractical for large datasets owing to its expensive computational cost. Despite the considerable effort made by the research community to introduce more advanced and accurate clustering algorithms [3], determining the optimal number of clusters in an automatic manner remains a challenging problem that affects the performance of most state-of-the-art techniques.
The main contribution of this work is the introduction of a novel semi-supervised fuzzy relational clustering algorithm, based on local distance learning that automatically determines the optimal number of clusters. In addition, the supervision information used in this approach steers the clustering process towards the optimal partition and escape from the local minima. Additionally, the proposed method is formulated to handle relational data. This makes it useful for real applications where only pairwise similarities (or dissimilarities) between data instances are available.
We dedicate this section to survey the most related works to our proposed approach. We partition this survey into two main categories based on the technical foundations of the published works. The first part of this section focuses on the different studies relevant to semi-supervised fuzzy clustering with measure learning. The approaches that form the second part enclose several methods that can be adopted to obtain the optimal number of clusters for a given dataset.
The research in [11] introduced the metric learning and pairwise constraints K-means (MPCK-means) algorithm, which is a semi-supervised clustering algorithm that unifies constraint-based and measure-based techniques. It is an extension of the well-known K-means algorithm [1] that employs both labeled and unlabeled data to guide the clustering and learn the distance measure simultaneously. However, this approach allows only linear mapping for the input data. Therefore, this clustering method cannot separate clusters that are nonlinearly separable.
In [12], an adaptive semi-supervised clustering kernel method (SCKMM) has been proposed. This study has proposed a kernel semi-supervised clustering algorithm, which uses an adaptive version of K-means that incorporates the pairwise constraints along with measure learning into a nonlinear framework. Although the experimental results on various datasets proved the effectiveness of the SCKMM, it remains practical with small datasets only owing to the computational complexity of solving the measure matrix.
In [13], a relational fuzzy semi-supervised clustering method with local distance measure learning (SURF-LDML) has been introduced. The supervisory information is integrated to guide the clustering process towards target categorization and to avoid the local minima. The side-information is also exploited to learn the underlying distance measure while categorizing the data. Moreover, SURF-LDML is formulated to work with relational data. However, SURF-LDML may not be suitable when the distribution of the different clusters in the input space exhibits large variations.
More recently, a novel clustering algorithm named fuzzy clustering with learnable cluster-dependent kernels (FLeCK) has been proposed [14]. The proposed algorithm learns the local Gaussian kernel parameters while categorizing the data. FLeCK is also designed to work with relational data. However, all the above state-of-the-art approaches assume that the number of clusters is given before starting the clustering process based on some experience or domain knowledge, which is considered a significant limitation that affects them in real-world clustering applications.
The competitive agglomeration (CA) [15] is an approach that can learn the number of clusters in an unsupervised way. The CA algorithm is an iterative process where the number of clusters decreases gradually with the number of iterations. It starts the clustering process with the maximum number of clusters and the final categorization represents the optimal number of clusters. The core challenge related to the CA algorithm is that it is prototype-based, and thus, not suitable for applications where the data points are not expressed using feature vectors. The possible alternative to overcome this drawback is to use relational data, which consists of the pairwise relations between each pair of points.
The CA for relational data (CARD) clustering algorithm [16], is an extended version of the fuzzy CA [15] that can deal effectively with complex, non-Euclidean relational data. The CA and CARD [16] algorithms can learn the optimal number of clusters in an unsupervised manner. However, unsupervised clustering is an optimization problem that is prone to many local minima [17]. A possible alternative to overcome this drawback can be accomplished by including pairwise constraints along with the unlabeled data to steer the learning task towards the optimal categorization and escape from local minima.
The semi-supervised fuzzy clustering algorithm with pairwise constraints (SCAPC) [18] was introduced to improve the typical semi-supervised fuzzy clustering with pairwise constraints. A novel penalty term was incorporated to the CA [15] objective function to discover accurately the optimal partition of the data by moderately changing the disagreement on the magnitude order between the penalty term and original objective function. Although the side information in the SCAPC approach is used to bias the clustering algorithm towards an optimal partitioning, it does not learn and adapt the underlying distance measure. Moreover, the performance of the clustering is significantly sensitive to the choice of the distance metric [19].
To overcome the above limitations, we introduce a novel semi-supervised relational fuzzy clustering algorithm with measure learning based on CA (SSRF-CA). The proposed algorithm is an extension of the SURF-LDML introduced in [13]. More precisely, we intend to exploit the CA [15] approach to learn the number of clusters while categorizing the data. The proposed SSRF-CA uses side information to penalize or reward the objective function to learn a better partition. Moreover, it learns the dependent dissimilarity measure with respect to each cluster.
Given the dataset
subject to
In
The distance
where
SSRF-CA depends on optimizing a combined objective function with five terms (as illustrated in
Analogously, the third term is the reward of satisfying a CL constraint. It is formulated in such a way that the reward among distant CL points is higher than the nearby points. This reward term is also weighted by the fuzzy memberships. If the distance among two CL points is high, then the learned distance measure for this cluster is suitable, and the reward should be increased to permit the modification.
The fourth term of the objective function in
The last term in
The weights
The proposed approach can handle both data matrices of the pairwise distance or pairwise dissimilarity. The method that we apply in order to achieve this goal is inspired by Hathaway et al. [20] where the relational version of the popular fuzzy C-mean (FCM) was introduced [9]. It has been proved in [20] that the squared Euclidean distance
where
In
The aim of the proposed SSRF-CA algorithm is to learn the fuzzy membership values
where
and
The first term in
and
Furthermore, to optimize the objective functions in
where
SSRF-CA is initialized with a large number of clusters
In
It is important to note that when a data point
To put it simply, if a data point
The value of
In
where
The SSRF-CA algorithm is summarized as Algorithm 1.
The computational complexity of the proposed SSRF-CA is
In the following, we first describe the considered datasets and experimental settings. Then, we introduce the performance metrics used to assess and analyze the obtained results. Finally, we prove the success of the proposed SSRF-CA and compare its performance to the relevant clustering algorithms below:
The CA [15]: The prototype-based clustering approach, outlined in Section 2.2, can learn the optimal number of clusters in an unsupervised way. The pairwise constraints, ML and CL, are ignored in this algorithm.
The SCAPC [18]: A semi-supervised version of the CA algorithm, depicted in Section 2.2, uses the pairwise constraints, ML and CL, to provide partial supervision.
We compare the proposed SSRF-CA with relevant clustering approaches using different synthetic and real-world benchmark datasets provided by the UCI machine-learning repository.
As the proposed algorithm is designed as a semi-supervised clustering technique, we randomly keep 10% of the labeled data points (called seed points) for guiding the learning process. In particular, the pairs of seed points residing in the same cluster compose the ML subset. Likewise, the pairs of seed points belonging to different clusters compose the CL subset. Tables 1 and 2 summarize the synthetic and real datasets, respectively.
The performance of the clustering algorithm is significantly sensitive to the choice of the parameter
For the proposed SSRF-CA, the two regularization parameters,
It is important to note that because of the bias term, the membership values,
To evaluate the performance of the proposed SSRF-CA and compare it with the considered relevant clustering algorithms, we assume that the true labels are given. Then, we calculated the clustering accuracy, Rand index, Jaccard coefficient, Fowlkes-Mallows [23], and the learned number of clusters to reflect the overall performance of each algorithm.
In this experiment, we illustrate how the new mapping enhances the clustering performance even when we include the new bias term in the objective function to learn the optimal number of clusters. We have chosen dataset 2 from the synthetic data to illustrate the learned distance measure because it includes clusters of diverse number of instances, shapes, and orientations.
In Figure 1, the small circles are the data points and each cluster is illustrated using a different color. The black color represents the 10% seed points used for supervision. For the visualization and interpretation of the learned measures, we used the fact that learning the underlying distance measure is equal to finding a rescaling that substitutes each input data assigned to cluster with
In Figure 1(a), the data is plotted in the original feature space, which means that the
This experiment focuses on how the proposed SSRF-CA is able to learn the optimal number of clusters that reflects the correct partition of the data. Specifically, we compare the results obtained using our proposed approach with those obtained using the state-of-the-art clustering approaches for the same synthetic and real datasets introduced in Section 4.1.
Figures 2
The geometric characteristics of dataset 2 renders it slightly difficult to categorize and learn the optimum number of clusters (as indicated in Figure 3(a)). The CA algorithm [15] was not able to partition this data and has brought the algorithm toward a local minimum. The CA algorithm has merged all the dataset in one cluster, as illustrated in Figure 3(b) because it cannot identify the circular-shaped clusters. Furthermore, integrating partial supervision in the SCAPC algorithm [18] did not provide significant help. The proposed SSRF-CA, on the other hand, overtakes all the other algorithms in partitioning the data and learning the optimal number of clusters despite the complexity of the geometry of this data (see Figure 3(d)). The proposed SSRF-CA uses the pairwise constraints information effectively to learn a cluster dependent distance measure and learn the optimal number of clusters.
For Dataset 3, defining the optimal number of clusters in this dataset is challenging as it contains four clusters with different degrees of overlapping, and various sizes and shapes that are adjacent to each other (as indicated in Figure 4(a)). We observe from Figure 4 that neither CA nor SCAPC has learned the number of clusters correctly, which is reflected from the categorization of the data. Figure 4 illustrates that SCAPC is close to CA because there is no distinct function for semi-supervised learning in SCAPC. This is mainly owing to the fact that SCAPC is using a global parameter for both ML and CL terms that makes the algorithm less effective in manipulating this data geometry. Furthermore, the proposed algorithm has failed in categorizing some data points that are at the boundaries of the clusters; however, it has learned the exact number of clusters and classified most of this dataset successfully. In fact, despite the complexity of the structure of this dataset, only the proposed SSRF-CA has relatively achieved a reasonable partition.
The fourth synthetic dataset contains five Gaussians distributed clusters that have similar shapes, densities, and balanced sizes that can be categorized by our proposed approach (see Figure 5(d)). The proposed algorithm can easily learn the optimal number of clusters while the other algorithms, CA and SCAPC, were not able to learn the exact clustering numbers and thus cannot separate the five clusters successfully. These two algorithms were prone to several local minima owing to the fact that the CA and SCAPC algorithms defined the distance measure as a priori. In fact, the performance of the clustering algorithms relies critically on the choice of the distance measure. However, the proposed SSRF-CA learned the measure by the supervision to escape from local minima and reflect the target categorization correctly.
We can conclude from Figures 2
This experiment demonstrates how our method improves the clustering performance. A summarization of the clustering results for both the synthetic and real datasets is represented in the following subsections. Figures 6 and 7 display the clustering results for the synthetic and real datasets, respectively. The statistical results prove the success of the proposed SSRF-CA regardless of the variability in cluster shape, compactness, and size. Not all the datasets can be partitioned well using the other considered clustering algorithms. Conversely, the proposed approach can perform relatively well on all the synthetic and real datasets. We have observed that the categorization performance is improved when we incorporate the bias term in spite of the variability in shape and density of each cluster. Moreover, integrating the pairwise constraints in the proposed SSRF-CA enables more effective grouping of the data and avoid the local minima. However, the pairwise constraints might include noisy constraints that negatively affect the structure of the clusters and thus mislead the categorization of the data.
In this paper, we propose a novel fuzzy relational semi-supervised clustering algorithm based on an adaptive local distance measure i.e., the SSRF-CA. The proposed clustering algorithm exploits side-information and uses it under the form of constraints to steer the learning task. Along with its ability to perform data clustering, our algorithm is designed to handle relational data. Moreover, it learns the dependent dissimilarity measure between the data instances and also learns the optimal number of clusters automatically.
In our experiments, we prove the ability of our proposed algorithm to learn the local distance measures and the optimal number of clusters while finding a compact cluster. We use various synthetic and real-world benchmark datasets that contain different number of clusters with diverse shapes. Based on the experimental results, we concluded that the proposed SSRF-CA outperforms the other state-of-the-art algorithms. This performance can be attributed to the effective use of the pairwise constraints in the proposed SSRF-CA to guide the algorithm towards the optimal partition and also to learn the underlying cluster distance measure.
Currently, we tune the parameters
Representation of the learned distance measure for dataset 2. (a) Data is plotted on the original feature space. (b) rescaled data is plotted using
Results of clustering dataset 1 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
Results of clustering dataset 2 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
Results of clustering dataset 3 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
Results of clustering dataset 4 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
Performance measures obtained on categorizing synthetic datasets: (a) dataset 1, (b) dataset 2, (c) dataset 3, and (d) dataset 4.
Performance measures obtained on categorizing real datasets: (a) bankruptcy, (b) seeds, (c) iris, and (d)Wi-Fi localization.
Table 1 . Four synthetic datasets used in our experiments.
# of data points | # of clusters | Cluster sizes | # seeds per cluster | Balanced/unbalanced | |
---|---|---|---|---|---|
Dataset 1 | 87 | 2 | 43,44 | 5,5 | Unbalanced |
Dataset 2 | 238 | 3 | 100.56,82 | 10,6,9 | Unbalanced |
Dataset 3 | 185 | 4 | 51,42,53,39 | 6,5,6,4 | Unbalanced |
Dataset 4 | 250 | 5 | 50,50,50,50,50 | 5,5,5,5,5 | Balanced |
Table 2 . Four real datasets used in our experiments.
# of data points | # of data attributes | # of clusters | Cluster sizes | # seeds per cluster | Balanced/unbalanced | |
---|---|---|---|---|---|---|
Bankruptcy | 250 | 7 | 2 | 142, 107 | 15, 11 | Unbalanced |
Seeds | 140 | 7 | 3 | 70,70 | 7,7 | Balanced |
Iris | 150 | 4 | 3 | 50,50,50 | 5,5,5 | Balanced |
Wi-Fi localization | 470 | 7 | 4 | 101, 108, 116, 145 | 11, 11, 12, 15 | Unbalanced |
Algorithm 1. SSRF-CA algorithm..
Fix the maximum number of clusters |
Initialize the fuzzy partition matrix |
Set |
Initialize the matrices |
Initialize the threshold |
Compute the initial cardinalities |
Compute the dissimilarity |
Compute the membership vectors |
Compute the distance |
Update |
Update the fuzzy membership using Eq. (7); |
Update the measure distance matrices |
Compute the cardinalities of all clusters |
If ( |
Hichem Frigui, Ouiem Bchir, and Naouel Baili
International Journal of Fuzzy Logic and Intelligent Systems 2013; 13(4): 254-268 https://doi.org/10.5391/IJFIS.2013.13.4.254Rajan Gupta, Sunil Kumar Muttoo, and Saibal K. Pal
International Journal of Fuzzy Logic and Intelligent Systems 2019; 19(4): 290-298 https://doi.org/10.5391/IJFIS.2019.19.4.290Witold Pedrycz
International Journal of Fuzzy Logic and Intelligent Systems 2013; 13(4): 245-253 https://doi.org/10.5391/IJFIS.2013.13.4.245Representation of the learned distance measure for dataset 2. (a) Data is plotted on the original feature space. (b) rescaled data is plotted using
Results of clustering dataset 1 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
|@|~(^,^)~|@|Results of clustering dataset 2 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
|@|~(^,^)~|@|Results of clustering dataset 3 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
|@|~(^,^)~|@|Results of clustering dataset 4 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
|@|~(^,^)~|@|Performance measures obtained on categorizing synthetic datasets: (a) dataset 1, (b) dataset 2, (c) dataset 3, and (d) dataset 4.
|@|~(^,^)~|@|Performance measures obtained on categorizing real datasets: (a) bankruptcy, (b) seeds, (c) iris, and (d)Wi-Fi localization.