International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(2): 156-167
Published online June 25, 2020
https://doi.org/10.5391/IJFIS.2020.20.2.156
© The Korean Institute of Intelligent Systems
Norah Ibrahim Fantoukh , Mohamed Maher Ben Ismail , and Ouiem Bchir
Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Correspondence to :
Mohamed Maher Ben Ismail and Ouiem Bchir (maher.benismail@gmail.com, ouiem.bchir@gmail.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Semi-supervised clustering relies on both labeled and unlabeled data to steer the clustering process towards optimal categorization and escape from local minima. In this paper, we propose a novel fuzzy relational semi-supervised clustering algorithm based on an adaptive local distance measure (SSRF-CA). The proposed clustering algorithm utilizes side-information and formulates it as a set of constraints to supervise the learning task. These constraints are expressed using reward and penalty terms, which are integrated into a novel objective function. In particular, we formulate the clustering task as an optimization problem through the minimization of the proposed objective function. Solving this optimization problem provides the optimal values of different objective function parameters and yields the proposed semi-supervised clustering algorithm. Along with its ability to perform data clustering and learn the underlying dissimilarity measure between the data instances, our algorithm determines the optimal number of clusters in an unsupervised manner. Moreover, the proposed SSRF-CA is designed to handle relational data. This makes it appropriate for applications where only pairwise similarity (or dissimilarity) information between data instances is available. In this paper, we proved the ability of the proposed algorithm to learn the appropriate local distance measures and the optimal number of clusters while partitioning the data using various synthetic and real-world benchmark datasets that contain varying numbers of clusters with diverse shapes. The experimental results revealed that the proposed SSRF-CA accomplished the best performance among other state-of-the-art algorithms and confirmed the outperformance of our clustering approach.
Keywords: Semi-supervised clustering, Relational data, Fuzzy clustering, Local distance measure learning, Optimal number of clusters
No potential conflict of interest relevant to this article was reported.
E-mail: n.fantoukh@gmail.com
*The photo is not included according to the author’s request.
E-mail: maher.benismail@gmail.com
E-mail ouiem.bchir@gmail.com
International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(2): 156-167
Published online June 25, 2020 https://doi.org/10.5391/IJFIS.2020.20.2.156
Copyright © The Korean Institute of Intelligent Systems.
Norah Ibrahim Fantoukh , Mohamed Maher Ben Ismail , and Ouiem Bchir
Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Correspondence to:Mohamed Maher Ben Ismail and Ouiem Bchir (maher.benismail@gmail.com, ouiem.bchir@gmail.com)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Semi-supervised clustering relies on both labeled and unlabeled data to steer the clustering process towards optimal categorization and escape from local minima. In this paper, we propose a novel fuzzy relational semi-supervised clustering algorithm based on an adaptive local distance measure (SSRF-CA). The proposed clustering algorithm utilizes side-information and formulates it as a set of constraints to supervise the learning task. These constraints are expressed using reward and penalty terms, which are integrated into a novel objective function. In particular, we formulate the clustering task as an optimization problem through the minimization of the proposed objective function. Solving this optimization problem provides the optimal values of different objective function parameters and yields the proposed semi-supervised clustering algorithm. Along with its ability to perform data clustering and learn the underlying dissimilarity measure between the data instances, our algorithm determines the optimal number of clusters in an unsupervised manner. Moreover, the proposed SSRF-CA is designed to handle relational data. This makes it appropriate for applications where only pairwise similarity (or dissimilarity) information between data instances is available. In this paper, we proved the ability of the proposed algorithm to learn the appropriate local distance measures and the optimal number of clusters while partitioning the data using various synthetic and real-world benchmark datasets that contain varying numbers of clusters with diverse shapes. The experimental results revealed that the proposed SSRF-CA accomplished the best performance among other state-of-the-art algorithms and confirmed the outperformance of our clustering approach.
Keywords: Semi-supervised clustering, Relational data, Fuzzy clustering, Local distance measure learning, Optimal number of clusters
Representation of the learned distance measure for dataset 2. (a) Data is plotted on the original feature space. (b) rescaled data is plotted using
Results of clustering dataset 1 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
Results of clustering dataset 2 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
Results of clustering dataset 3 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
Results of clustering dataset 4 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
Performance measures obtained on categorizing synthetic datasets: (a) dataset 1, (b) dataset 2, (c) dataset 3, and (d) dataset 4.
Performance measures obtained on categorizing real datasets: (a) bankruptcy, (b) seeds, (c) iris, and (d)Wi-Fi localization.
Table 1 . Four synthetic datasets used in our experiments.
# of data points | # of clusters | Cluster sizes | # seeds per cluster | Balanced/unbalanced | |
---|---|---|---|---|---|
Dataset 1 | 87 | 2 | 43,44 | 5,5 | Unbalanced |
Dataset 2 | 238 | 3 | 100.56,82 | 10,6,9 | Unbalanced |
Dataset 3 | 185 | 4 | 51,42,53,39 | 6,5,6,4 | Unbalanced |
Dataset 4 | 250 | 5 | 50,50,50,50,50 | 5,5,5,5,5 | Balanced |
Table 2 . Four real datasets used in our experiments.
# of data points | # of data attributes | # of clusters | Cluster sizes | # seeds per cluster | Balanced/unbalanced | |
---|---|---|---|---|---|---|
Bankruptcy | 250 | 7 | 2 | 142, 107 | 15, 11 | Unbalanced |
Seeds | 140 | 7 | 3 | 70,70 | 7,7 | Balanced |
Iris | 150 | 4 | 3 | 50,50,50 | 5,5,5 | Balanced |
Wi-Fi localization | 470 | 7 | 4 | 101, 108, 116, 145 | 11, 11, 12, 15 | Unbalanced |
Algorithm 1. SSRF-CA algorithm..
Fix the maximum number of clusters |
Initialize the fuzzy partition matrix |
Set |
Initialize the matrices |
Initialize the threshold |
Compute the initial cardinalities |
Compute the dissimilarity |
Compute the membership vectors |
Compute the distance |
Update |
Update the fuzzy membership using Eq. (7); |
Update the measure distance matrices |
Compute the cardinalities of all clusters |
If ( |
Hichem Frigui, Ouiem Bchir, and Naouel Baili
International Journal of Fuzzy Logic and Intelligent Systems 2013; 13(4): 254-268 https://doi.org/10.5391/IJFIS.2013.13.4.254Rajan Gupta, Sunil Kumar Muttoo, and Saibal K. Pal
International Journal of Fuzzy Logic and Intelligent Systems 2019; 19(4): 290-298 https://doi.org/10.5391/IJFIS.2019.19.4.290Witold Pedrycz
International Journal of Fuzzy Logic and Intelligent Systems 2013; 13(4): 245-253 https://doi.org/10.5391/IJFIS.2013.13.4.245Representation of the learned distance measure for dataset 2. (a) Data is plotted on the original feature space. (b) rescaled data is plotted using
Results of clustering dataset 1 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
|@|~(^,^)~|@|Results of clustering dataset 2 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
|@|~(^,^)~|@|Results of clustering dataset 3 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
|@|~(^,^)~|@|Results of clustering dataset 4 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.
|@|~(^,^)~|@|Performance measures obtained on categorizing synthetic datasets: (a) dataset 1, (b) dataset 2, (c) dataset 3, and (d) dataset 4.
|@|~(^,^)~|@|Performance measures obtained on categorizing real datasets: (a) bankruptcy, (b) seeds, (c) iris, and (d)Wi-Fi localization.