Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(2): 156-167

Published online June 25, 2020

https://doi.org/10.5391/IJFIS.2020.20.2.156

© The Korean Institute of Intelligent Systems

Automatic Determination of the Number of Clusters for Semi-Supervised Relational Fuzzy Clustering

Norah Ibrahim Fantoukh , Mohamed Maher Ben Ismail , and Ouiem Bchir

Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia

Correspondence to :
Mohamed Maher Ben Ismail and Ouiem Bchir (maher.benismail@gmail.com, ouiem.bchir@gmail.com)

Received: February 16, 2020; Revised: May 10, 2020; Accepted: May 26, 2020

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Semi-supervised clustering relies on both labeled and unlabeled data to steer the clustering process towards optimal categorization and escape from local minima. In this paper, we propose a novel fuzzy relational semi-supervised clustering algorithm based on an adaptive local distance measure (SSRF-CA). The proposed clustering algorithm utilizes side-information and formulates it as a set of constraints to supervise the learning task. These constraints are expressed using reward and penalty terms, which are integrated into a novel objective function. In particular, we formulate the clustering task as an optimization problem through the minimization of the proposed objective function. Solving this optimization problem provides the optimal values of different objective function parameters and yields the proposed semi-supervised clustering algorithm. Along with its ability to perform data clustering and learn the underlying dissimilarity measure between the data instances, our algorithm determines the optimal number of clusters in an unsupervised manner. Moreover, the proposed SSRF-CA is designed to handle relational data. This makes it appropriate for applications where only pairwise similarity (or dissimilarity) information between data instances is available. In this paper, we proved the ability of the proposed algorithm to learn the appropriate local distance measures and the optimal number of clusters while partitioning the data using various synthetic and real-world benchmark datasets that contain varying numbers of clusters with diverse shapes. The experimental results revealed that the proposed SSRF-CA accomplished the best performance among other state-of-the-art algorithms and confirmed the outperformance of our clustering approach.

Keywords: Semi-supervised clustering, Relational data, Fuzzy clustering, Local distance measure learning, Optimal number of clusters

The authors are grateful for the support they received from the Research Center of the College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.

No potential conflict of interest relevant to this article was reported.

Norah Ibrahim Fantoukh is a teaching assistant in the Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia. She received her master’s degree in Computer Science from King Saud University in 2020. Her research interests include pattern recognition, machine learning, and data mining.

E-mail: n.fantoukh@gmail.com

*The photo is not included according to the author’s request.

Mohamed Maher Ben Ismail is an associate professor in the Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia. He received his Ph.D. in Computer Science from the University of Louisville in 2011. His research interests include pattern recognition, machine learning, data mining, and image processing.

E-mail: maher.benismail@gmail.com

Ouiem Bchir is an associate professor in the Computer Science Department, College of Computer and Information Systems, King Saud University, Riyadh, Saudi Arabia. She got her Ph.D. from the University of Louisville, KY, USA. Her research interests are pattern recognition, machine learning, and hyperspectral image analysis. She received the University of Louisville Dean’s Citation, the University of Louisville CSE Doctoral Award, and the Tunisian presidential award for her electrical engineering diploma.

E-mail ouiem.bchir@gmail.com

Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2020; 20(2): 156-167

Published online June 25, 2020 https://doi.org/10.5391/IJFIS.2020.20.2.156

Copyright © The Korean Institute of Intelligent Systems.

Automatic Determination of the Number of Clusters for Semi-Supervised Relational Fuzzy Clustering

Norah Ibrahim Fantoukh , Mohamed Maher Ben Ismail , and Ouiem Bchir

Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia

Correspondence to:Mohamed Maher Ben Ismail and Ouiem Bchir (maher.benismail@gmail.com, ouiem.bchir@gmail.com)

Received: February 16, 2020; Revised: May 10, 2020; Accepted: May 26, 2020

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Semi-supervised clustering relies on both labeled and unlabeled data to steer the clustering process towards optimal categorization and escape from local minima. In this paper, we propose a novel fuzzy relational semi-supervised clustering algorithm based on an adaptive local distance measure (SSRF-CA). The proposed clustering algorithm utilizes side-information and formulates it as a set of constraints to supervise the learning task. These constraints are expressed using reward and penalty terms, which are integrated into a novel objective function. In particular, we formulate the clustering task as an optimization problem through the minimization of the proposed objective function. Solving this optimization problem provides the optimal values of different objective function parameters and yields the proposed semi-supervised clustering algorithm. Along with its ability to perform data clustering and learn the underlying dissimilarity measure between the data instances, our algorithm determines the optimal number of clusters in an unsupervised manner. Moreover, the proposed SSRF-CA is designed to handle relational data. This makes it appropriate for applications where only pairwise similarity (or dissimilarity) information between data instances is available. In this paper, we proved the ability of the proposed algorithm to learn the appropriate local distance measures and the optimal number of clusters while partitioning the data using various synthetic and real-world benchmark datasets that contain varying numbers of clusters with diverse shapes. The experimental results revealed that the proposed SSRF-CA accomplished the best performance among other state-of-the-art algorithms and confirmed the outperformance of our clustering approach.

Keywords: Semi-supervised clustering, Relational data, Fuzzy clustering, Local distance measure learning, Optimal number of clusters

Fig 1.

Figure 1.

Representation of the learned distance measure for dataset 2. (a) Data is plotted on the original feature space. (b) rescaled data is plotted using xxAi12.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 156-167https://doi.org/10.5391/IJFIS.2020.20.2.156

Fig 2.

Figure 2.

Results of clustering dataset 1 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 156-167https://doi.org/10.5391/IJFIS.2020.20.2.156

Fig 3.

Figure 3.

Results of clustering dataset 2 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 156-167https://doi.org/10.5391/IJFIS.2020.20.2.156

Fig 4.

Figure 4.

Results of clustering dataset 3 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 156-167https://doi.org/10.5391/IJFIS.2020.20.2.156

Fig 5.

Figure 5.

Results of clustering dataset 4 using three different algorithms: (a) true labels, (b) CA, (c) SCAPC, and (d) the proposed approach.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 156-167https://doi.org/10.5391/IJFIS.2020.20.2.156

Fig 6.

Figure 6.

Performance measures obtained on categorizing synthetic datasets: (a) dataset 1, (b) dataset 2, (c) dataset 3, and (d) dataset 4.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 156-167https://doi.org/10.5391/IJFIS.2020.20.2.156

Fig 7.

Figure 7.

Performance measures obtained on categorizing real datasets: (a) bankruptcy, (b) seeds, (c) iris, and (d)Wi-Fi localization.

The International Journal of Fuzzy Logic and Intelligent Systems 2020; 20: 156-167https://doi.org/10.5391/IJFIS.2020.20.2.156

Table 1 . Four synthetic datasets used in our experiments.

# of data points# of clustersCluster sizes# seeds per clusterBalanced/unbalanced
Dataset 187243,445,5Unbalanced
Dataset 22383100.56,8210,6,9Unbalanced
Dataset 3185451,42,53,396,5,6,4Unbalanced
Dataset 4250550,50,50,50,505,5,5,5,5Balanced

Table 2 . Four real datasets used in our experiments.

# of data points# of data attributes# of clustersCluster sizes# seeds per clusterBalanced/unbalanced
Bankruptcy25072142, 10715, 11Unbalanced
Seeds1407370,707,7Balanced
Iris1504350,50,505,5,5Balanced
Wi-Fi localization47074101, 108, 116, 14511, 11, 12, 15Unbalanced

Algorithm 1. SSRF-CA algorithm..

Fix the maximum number of clusters C = Cmax and m ∈ [1, ∞);
Initialize the fuzzy partition matrix U(CL) randomly;
Set α1, α2, and α3;
Initialize the matrices Ai to the identity matrix;
Initialize the threshold ε1;
Compute the initial cardinalities Ni for 1 ≤ iC by using Eq. (10);
REPEAT
Compute the dissimilarity RAi for all clusters using Eq. (6);
Compute the membership vectors vi using Eq. (5);
For each point xk, k ∈ 1, 2, …, N
Compute the distance distik2 using Eq. (4);
Update α3(k) using Eq. (14) and (15);
Update the fuzzy membership using Eq. (7);
Update the measure distance matrices Ai using Eq. (12);
Compute the cardinalities of all clusters Ni by using equation Eq. (10);
If (Ni < ε1) remove cluster i and update the number of clusters C;
UNTIL (The max number of iterations is reached or the fuzzy memberships stabilize)

Share this article on :

Related articles in IJFIS