Article Search
닫기

Original Article

Split Viewer

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 267-275

Published online September 25, 2022

https://doi.org/10.5391/IJFIS.2022.22.3.267

© The Korean Institute of Intelligent Systems

An Evolving Fuzzy Model to Determine an Optimal Number of Data Stream Clusters

Hussein A. A. Al-Khamees, Nabeel Al-A’araji, and Eman S. Al-Shamery

Department of Software, Babylon University, Babylon, Iraq

Correspondence to :
Hussein A. A. Al-Khamees (hussein.alkhamees7@gmail.com)

Received: May 15, 2022; Accepted: September 6, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Data streams are a modern type of data that differ from traditional data in various characteristics: their indefinite size, high access, and concept drift due to their origin in non-stationary environments. Data stream clustering aims to split these data samples into significant clusters, depending on their similarity. The main drawback of data stream clustering algorithms is the large number of clusters they produce. Therefore, determining an optimal number of clusters is an important challenge for these algorithms. In practice, evolving models can change their general structure by implementing different mechanisms. This paper presents a fuzzy model that mainly consists of an evolving Cauchy clustering algorithm which is updated through a specific membership function and determines the optimal number of clusters by implementing two evolving mechanisms: adding and splitting clusters. The proposed model was tested on six different streaming datasets, namely, power supply, sensor, HuGaDB, UCI-HAR, Luxembourg, and keystrokes. The results demonstrated that the efficiency of the proposed model in producing an optimal number of clusters for each dataset outperforms that of previous models.

Keywords: Data stream clustering, Clusters number, Evolving mechanisms

Artificial intelligence (AI) is a broad branch of computer science, and machine learning is the backbone of various techniques, among which clustering is one of the most important [1]. It aims to split a dataset into significant clusters of data samples that have a high degree of similarity between them, and a low degree of similarity with the data samples in other clusters [2].

Data stream analysis has become one of the most active and effective research fields in computer science due to the diverse challenges that it poses compared with traditional data analysis, including the indefinite size, high access, single scan, limited memory and processing time, and concept drift caused by the non-stationary condition of the environments where data originate [3]. Data stream mining links two essential computer fields: data mining tasks and data streams [4].

Many techniques for analyzing traditional data can also be implemented on data streams, with clustering being one of them. Data stream clustering methods can be categorized into five types, one of which is the density-based class. This category includes the evolving Cauchy (e-Cauchy) clustering algorithm, on which this study is based [5].

Evolving systems are naturally able to change the general structure of the model designed to describe the data stream by updating it after the arrival of each data sample. This is achieved through several mechanisms, such as adding, merging, and splitting clusters to reduce the large number of clusters generated [6].

Fuzzy systems are extensively used in different fields that are based on the concept of fuzzy logic. In addition, evolving fuzzy algorithms are considered an important type of evolving systems because of their ability to interact with the data provided [7]].

Determining an optimal number of generated clusters remains an open challenge, and traditional solutions cannot be applied to all cases. Therefore, evolving mechanisms are proposed as a solution to set an appropriate number of data stream clusters [8].

Most clustering algorithms (including e-Cauchy) generate a large number of clusters. Thus, the research question addressed in this study is how to overcome the large number of clusters generated while applying clustering algorithms to data streams.

To overcome this problem, this study proposes an evolving model based on the e-Cauchy algorithm, which adopts new evolving mechanisms such as adding and splitting clusters into high-quality and low-quality clusters by re-assigning all data samples from low-to high-quality clusters. Moreover, the paper presents a new fuzzy method for distributing testing data.

To evaluate the proposed model, several streaming datasets were used, namely, power supply, sensor, HuGaDB, UCI-HAR, Luxembourg, and keystrokes. The results demonstrated the model’s efficiency in producing an optimal number of clusters for each dataset and showed that it has a higher silhouette coefficient than other previous models, thus, outperforming them.

This section presents the most relevant studies related to the determination of an optimal number of generated clusters.

The authors in [9] discussed the limitations of the K-means algorithm, the first of which is the determination of the number of clusters. They presented a cluster number assisted K-means (CNAK) model that estimates this number based on bipartite matching and by adjusting the algorithm parameters. This model can generate different scores for the number of clusters (NoC), which ranges from 1 to NoCmax, by applying the Kuhn-Munkres algorithm to obtain bipartite matching. The authors selected a random sample, used it as a reference, mapped other centroid clusters and compared them.

The authors of [10] used one of the most effective validity indices, the entropy measure, as an indicator to determine the optimal number of clusters. Initialization is one of the most important steps in K-means algorithms; in this model it is achieved by selecting the first data sample through entropy, hence, the model is known as an entropy-based initialization model. In addition, the model aims to maximize Shannon entropy-based objective function. The authors proved that their model is better than the K-means algorithm in terms of the number of clusters.

A density-peak-based clustering algorithm for automatically determining the number (DPADN) of clusters was proposed in [8]. This model consists of three steps. First, a density measure is designed by applying a Gaussian kernel function to compute the density for all dataset samples. Second, a precluster approach is constructed to find the center of each cluster. Finally, a method is proposed which automatically sets the center of the clusters. The evolving mechanisms of this model include searching the nearest two clusters and then merging them into one.

Data stream mining can be defined as the uncovering interesting and useful patterns from a large amount of data in a way that makes these patterns understandable. This can achieved by many techniques [11].

In recent years, clustering has become one of the most important and widely used data stream mining techniques. Data stream learning is divided into two main types: supervised and unsupervised learning. Clustering is an unsupervised learning method [12].

Different methods are used to group (cluster) all the samples in a given dataset into clusters, each containing many samples that have a high degree of similarity between each other but are not similar to the samples in other clusters [11].

There are five types of data streams clustering methods: partitioning, hierarchical, grid-based, density-based, and model-based methods [13]].

These same five types are also applied to traditional data. The model herein proposed is based on the e-Cauchy algorithm, which is a density-based method.

Cluster validation comprises three major tasks: clustering evaluation, clustering stability, and clustering tendency. Each task has several aspects. Clustering stability aims to form a good background for the clustering result sensitivity to differentiate algorithm parameters, such as the number of clusters [14].

Cluster validation is also called determining the number of clusters in a dataset. The methods applied to set the number of clusters can be classified into three groups: methods based on merging and splitting, traditional methods, and methods based on evolutionary computation. The first of these groups includes evolving methods that can be implemented on datasets to determine cluster numbers [14]

Traditional models usually fail when dealing with streaming data because of the challenges posed by this modern data type. Evolving systems represent an important and attractive solution for handling data streams [13] One of the most important characteristics of evolving systems is their ability to change the general structure of the model designed to describe the data stream. Accordingly, the major task and key feature of any evolving system is associated with several mechanisms, such as adding, splitting, merging, and removing entities. When the evolving system is based on the clustering technique, the above mechanisms are implemented in terms of clusters, i.e., adding clusters, splitting clusters, etc. [5].

This ensures greater flexibility when the system situations change into non-stationary environments where the data evolve over time or when concept drift appears. In traditional models, the data distribution is assumed to be stationary and the structure of the models and their parameters do not change over time. In evolving systems, the data environment is non-stationary; therefore, both the structure and parameters of the models change over time in what is classified as a dynamic state [8].

In these cases, the model is more related to non-stationary environments that can generate a continuous stream of data whose distribution does not remain constant.

More specifically, some designers train their models offline (offline training), which is sometimes known as batch training. The model may initially perform well, that is, the performance of the model will gradually deteriorate, especially as the upcoming data evolves. Moreover, evolving systems have the ability to forget data (especially after updating) in a proactive manner to maintain memory efficiency so that the data stream faces no problem [15].

Certainly, the analytics of these data must be performed in real time; therefore, online analytics may be implemented through evolving identification methods that allow the simultaneous adjustment of the structure and parameters [16]c.

Furthermore, when the system is time-variant, it is necessary to describe the various behaviors through the evolution of the model structure and to identify the parameters online [17].

Naturally, because there is a change in the general behavior of the data stream, the learner should evolve to keep up with the data change. The training step is the general idea behind an evolving system. This step comprises the following [18]:

  • 1) Splitting the input space via a clustering algorithm to learn the antecedent parameters.

  • 2) Tuning the parameters and structure to apply the evolution mechanisms.

  • 3) Implement a learning technique to learn the consequent parameters.

Several clustering algorithms for data streams have been developed in recent years. In this study, we employed the e-Cauchy algorithm. The main idea was to compute the density of each arriving data sample from the dataset to construct the initial clusters, which are typically numerous. Subsequently, the cluster splitting mechanism was applied. Figure 1 shows a block diagram of the proposed model comprising five main parts that are explained in detail in the following subsections.

3.1 Data Pre-processing

Normalization is an essential data pre-processing step for most types of problems. Normalization can be accomplished through many methods, including min-max, decimal scaling, z-score, median and MAD, double sigmoid, and tanh estimators [19]. In the proposed model, the min-max technique was implemented. Suppose there is a set of matching scores {Ms}, where s = 1, 2, ..., n, then the normalized scores (Ms’) are computed as:

Ms=(Ms-min)(max-min).

The dataset is then partitioned into training (70% of the dataset) and test data (30%).

3.2 Clustering Algorithm

The core step of the e-Cauchy clustering algorithm is the computation of the density of each training data sample. It is worth noting that this algorithm uses two predefined thresholds: θ (that is set to 0.1) and the density threshold (Denthr), which is unique for any given dataset. Specifically, the latter was considered the most effective threshold.

The algorithm proceeds as follows: When the first data sample arrives, the first cluster is constructed and its parameters are set. When an additional data sample arrives, the density value is computed and compared with Denthr; if the density value is less than Denthr, then a new cluster is constructed; otherwise, the current data sample is appended to an existing cluster and the parameters of the corresponding cluster are modified. These steps are repeated for all samples of the selected dataset to simultaneously build the initial clusters, thus performing the first cluster addition mechanism.

The pseudocode for the e-Cauchy algorithm is shown in Figure 2, where data stream D consists of {x1, x2, ..., xi}, SCj refers to the samples in cluster (j), CCj denotes the center of cluster (j), and NoC is the number of clusters.

3.3 Splitting Mechanism

As mentioned above, the cluster splitting mechanism is applied to divide all the generated clusters (initial clusters from Algorithm 1 in Figure 2) into high- and low-quality clusters (HQC and LQC). By applying this mechanism, the model can evolve by re-assigning the samples in all LQCs to HQCs to produce the final HQCs.

However, both the cluster adding and splitting mechanisms occur while training the model (with training data), whereas the test data samples are assigned only to the HQCs that result from the cluster splitting mechanism. Figure 3 illustrates the pseudocode for the splitting algorithm.

3.4 Assignment of Test Data

In this model, the assignment of the test data samples require three steps: computing the distance between the current test data and the first cluster, building a sub-cluster with a radius equal to the computed distance, and determining all testing data samples that fall within the built sub-cluster. Subsequently, the distances of all data samples in the sub-cluster and their average are calculated.

As a result of these steps, a set of distances is available for each test data sample. In fact, the number of distances should be equal to the number of HQCs formed by applying Algorithm 2 in Figure 3.

In some cases, the number of distances is large; thus, additional calculation time may be required to measure each time (as in the case of the sensor data stream, which has 54 clusters) (Table 1). Therefore, it is clear that using a fuzzy membership function is very useful. In the proposed methodology all computed distances were included in a membership function, as described in line 5 of the pseudocode for the test data assignment algorithm shown in Figure 4.

Finally, the minimum distance is selected (as described in line 6 of Algorithm 3 in Figure 4) to assign the current sample and all samples within the specified radius (as shown in line 7). This step is performed to ensure that the current data sample is assigned to the nearest cluster.

3.5 Evaluation

The silhouette coefficient (SC) is an internal index that evaluates the quality of the clustering results. More specifically, SC indicates whether the data samples were correctly clustered and separated (clusters coherence and goodness). The SC for data (i) is computed as [20]:

SCx(i)=(bx(i)-ax(i))max(ax(i)·bx(i)),

where ax(i) is the average distance from x(i) to every data sample in the same cluster, and bx(i) is the lowest average distance between x(i) and every data sample in other clusters. It is worth pointing out that the SC for each stream dataset is measured before and after the evolving process for both the training and test data.

Based on the confusion matrix, which provides good details about the classifier, we use accuracy and purity [21]. Accuracy is computed according to

Accuracy=(TP+TN)(TP+TN+FP+FN),

where TP is the true positive, TN is the true negative, FP is the false positive, and FN is the false negative.

Purity can be calculated by

Purity=1ni=1nPrecision,

where the precision is given by

Precsion=(TP)(TP+FP).

The proposed model was implemented using the programming language Python 3.6.9 on a Windows 10 Professional operating system, with a 2.5 GHz core I7 CPU, and 16 GB of RAM.

To evaluate the proposed model, different streaming datasets were used, which were generated from diverse domains, such as questionnaires, user behavior recognition, and human activity recognition. Table 1 provides a brief overview of the datasets with a numerical data type [22,23].

As explained earlier, the e-Cauchy clustering algorithm contains two thresholds; the most important one is Denthr. However, the value of this threshold is varies from one dataset to another. Therefore, a certain threshold value was set for each streaming dataset.

The first dataset used to test the proposed model was the power supply stream dataset, for which Denthr was set to 0.0161. Figure 5 shows the results for this dataset. The model initially generated 1,400 clusters; after implementing the evolving step, the number reduced to 678, finally achieving an optimal number of clusters of 24. The SC for the training data before evolving was −0.75, changing to 0.30 after evolution, whereas the SC for the test data before and after evolving was 0.31 and 0.33, respectively.

Considering the SC value of the testing step for the power supply dataset, which is 0.33, the proposed model outperforms the models presented in [24] and [25], which achieved an SC of 0.18 and 0.32, respectively.

The second dataset used for testing the model was the sensor data stream. For this dataset, Denthr was set to 0.0011. The initial number of clusters was 1, 887, which reduces to 1, 052 and then to 412; finally, the optimal number of clusters was 54. In terms of SC, for the training step it was −0.52 before evolving and 0.22 afterwards; whereas for the test data it was 0.08 and 0.49 before and after evolving, respectively. Figure 6 shows the results for the sensor data stream.

Based on the final SC value of the testing step for this dataset (0.49), the proposed model achieved a higher SC than the model proposed in [24], which achieved an SC value of 0.30.

The HuGaDB01 01 sensor dataset was then used to test the proposed model, and Denthr was set to 0.0038. The number of clusters was initially 852 and then 221, then decreased to 9 and finally to 4. In terms of SC, its value for the training step before evolution was −0.68, and after evolving 0.39. The SC for the testing step before evolving was −0.002, increasing to 0.52 after evolving. Figure 7 illustrates these results.

The fourth evaluation dataset was the UCI-HAR stream dataset, and its Denthr was set to 0.00028. The number of clusters was initially 3,605, which decreased to 933,412 and finally 6, which indicates the optimal NoC. The SC of the training data before evolution was −0.32, which increased to 0.41 after evolving, while the SC for the test data was 0.11 before evolution, and 0.53 thereafter. Figure 8 shows the results for the UCI-HAR dataset.

According to the SC value of the UCI-HAR dataset, the proposed model outperforms the model presented in [26], which achieved an SC of 0.441 by implementing the K-means algorithm.

Next, the model was tested on the Luxembourg stream dataset, with a Denthr of 0.0026. The number of clusters at the beginning was 1,901, then 666, and finally 2, which was the optimal NoC. The SC of the training data before evolution was −0.44, which increases to 0.57 after evolving, whereas the SC for the test data before the evolution was −0.08 and then 0.57. Figure 9 shows the results for this dataset.

The last dataset was the keystrokes stream dataset, for which Denthr was set to 0.055. Initially, the number of clusters was 592, then the NoC decreased to 39, to 8 and then to 4. The SC of the training data before and after evolution were −0.36 and 0.29, respectively. The SC for the test data before the evolution was 0.01, and then increased to 0.57. These results are shown in Figure 10.

The accuracy and purity for each stream dataset are listed in Table 2. Based on the results in Table 2, the proposed model outperforms many previous models. For the sensor data stream, this model achieved a purity of 81.25%, whereas the model in [27] had a purity of 76.5%. In the case of the UCI-HAR dataset, the accuracy of the proposed model was 77.39%, whereas that of the model in [28] was 78.79%. Similarly, the proposed model achieved an accuracy of 84.57% for the keystrokes dataset, whereas the model in [29] attained an accuracy of 77.0%.

In general, the proposed model is highly accurate in assigning the data samples to an appropriate cluster, although it is not suitable for processing highly dimensional stream datasets.

Data streams are a modern type of data that can be generated in real-world applications. Their main characteristics are their massive size, high speed, and concept drift. Many techniques can be used on data streams, including clustering, which aims to group similar data samples into different clusters. Fuzzy systems are widely used in computer science, particularly in the field of AI. Determining an optimal number of clusters remains an open problem for researchers, as there is no static method for this purpose. This paper presents a fuzzy model to overcome such a problem. The proposed fuzzy model depends on the e-Cauchy clustering algorithm, which is a density-based data stream clustering method and implements a specific fuzzy membership function. The model applies two evolving mechanisms: adding and splitting clusters. The evaluation step involves finding an optimal number of clusters, as well as computing the SC, accuracy, and purity. Six streaming datasets were used to evaluate this model, namely, power supply, sensor, HuGaDB, UCI-HAR, Luxembourg, and keystrokes. The results obtained from the proposed model were analyzed and compared with those of previous models, showing that this model is more efficient and has a better performance than other existing models.

Our future work will focus on applying the proposed model to online cloud computing and synchronizing it with login services to determine the authenticity and validity of users.

Fig. 1.

Block diagram of the proposed method.


Fig. 2.

Pseudocode for the e-Cauchy algorithm.


Fig. 3.

Pseudocode for the cluster splitting algorithm.


Fig. 4.

Algorithm for assigning test samples.


Fig. 5.

(a) NoC and (b) SC for the power supply dataset.


Fig. 6.

(a) NoC and (b) SC for the sensor data stream.


Fig. 7.

(a) NoC and (b) SC for the HuGaDB dataset.


Fig. 8.

(a) NoC and (b) SC for UCI-HAR dataset.


Fig. 9.

(a) NoC and (b) SC for the Luxembourg dataset.


Fig. 10.

(a) NoC and (b) SC for keystrokes data stream.


Table. 1.

Table 1. Characteristics of numerical streaming datasets.

#DatasetClassesFeaturesSamples
1Power supply24229,928
2Sensor5452,219,803
3HuGaDB01_014392,435
4UCI-HAR656110,299
5Luxembourg2301,901
6Keystrokes4101,600

Table. 2.

Table 2. Quality indices for the tested stream datasets.

#DatasetAccuracy (%)Purity (%)
1Power supply72.3071.88
2Sensor81.4881.25
3HuGaDB01_0189.7589.12
4UCI-HAR77.3977.00
5Luxembourg97.8097.80
6Keystrokes84.5784.93

  1. Cioffi, R, Travaglioni, M, Piscitelli, G, Petrillo, A, and De Felice, F (2020). Artificial intelligence and machine learning applications in smart production: progress trends, and directions. Sustainability. 12. article no 492
    CrossRef
  2. Abdullatif, A, Masulli, F, and Rovetta, S (2018). Clustering of nonstationary data streams: a survey of fuzzy partitional methods. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 8. article no. e1258
  3. Al-Khamees, HA, and Al-Shamery, ES . Survey: Clustering techniques of data stream., Proceedings of 2021 1st Babylon International Conference on Information Technology and Science (BICITS), 2021, Babil, Iraq, Array, pp.113-119. https://doi.org/10.1109/BICITS51482.2021.9509923
  4. Al-Tarawneh, M (2021). Data Stream classification algorithms for workload orchestration in vehicular edge computing: a comparative evaluation. International Journal of Fuzzy Logic and Intelligent Systems. 21, 101-122. https://doi.org/10.5391/IJFIS.2021.21.2.101
    CrossRef
  5. Skrjanc, I, Ozawa, S, Ban, T, and Dovzan, D (2018). Large-scale cyber attacks monitoring using evolving Cauchy possibilistic clustering. Applied Soft Computing. 62, 592-601. https://doi.org/10.1016/j.asoc.2017.11.008
    CrossRef
  6. Al-Khamees, HAA, Al-A’araji, N, and Al-Shamery, ES (2021). Data stream clustering using fuzzy-based evolving Cauchy algorithm. International Journal of Intelligent Engineering and Systems. 14, 348-358. https://doi.org/10.22266/ijies2021.1031.31
    CrossRef
  7. Lughofer, E (2011). Evolving Fuzzy Systems: Methodologies, Advanced Concepts and Applications. Berlin: Springer https://doi.org/10.1007/978-3-642-18087-3
  8. Tong, W, Liu, S, and Gao, XZ (2021). A density-peak-based clustering algorithm of automatically determining the number of clusters. Neurocomputing. 458, 655-666. https://doi.org/10.1016/j.neucom.2020.03.125
    CrossRef
  9. Saha, J, and Mukherjee, J (2021). CNAK: cluster number assisted K-means. Pattern Recognition. 110. article no 107625
    CrossRef
  10. Chowdhury, K, Chaudhuri, D, and Pal, AK (2021). An entropy-based initialization method of K-means clustering on the optimal number of clusters. Neural Computing and Applications. 33, 6965-6982. https://doi.org/10.1007/s00521-020-05471-9
    CrossRef
  11. Al-Khamees, HAA, Al-Jwaid, WRH, and Al-Shamery, ES (2022). The impact of using convolutional neural networks in COVID-19 tasks: a survey. International Journal of Computing and Digital Systems. 11, 189-197. https://doi.org/10.12785/ijcds/110194
  12. Wicaksana, Wiharto AK, and Cahyani, DE (2021). Modification of a density-based spatial clustering algorithm for applications with noise for data reduction in intrusion detection systems. International Journal of Fuzzy Logic and Intelligent Systems. 21, 189-203. https://doi.org/10.5391/IJFIS.2021.21.2.189
    CrossRef
  13. Mousavi, M, Bakar, AA, and Vakilian, M (2015). Data stream clustering algorithms: a review. Int J Adv Soft Comput Appl. 7, 1-15.
  14. Hancer, E, and Karaboga, D (2017). A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number. Swarm and Evolutionary Computation. 32, 49-67. https://doi.org/10.1016/j.swevo.2016.06.004
    CrossRef
  15. Angelov, P, and Kordon, A (2010). Evolving inferential sensors in the chemical process industry. Evolving Intelligent Systems: Methodology and Applications. Hoboken, NJ: John Wiley & Sons, pp. 313-336 https://doi.org/10.1002/9780470569962.ch14
  16. Skrjanc, I, Andonovski, G, Ledezma, A, Sipele, O, Iglesias, JA, and Sanchis, A (2018). Evolving cloud-based system for the recognition of drivers’ actions. Expert Systems with Applications. 99, 231-238. https://doi.org/10.1016/j.eswa.2017.11.008
    CrossRef
  17. Souza, V, dos Reis, DM, Maletzke, AG, and Batista, GE (2020). Challenges in benchmarking stream learning algorithms with real-world data. Data Mining and Knowledge Discovery. 34, 1805-1858. https://doi.org/10.1007/s10618-020-00698-5
    CrossRef
  18. Baruah, RD, and Angelov, P (2011). Evolving fuzzy systems for data streams: a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 1, 461-476. https://doi.org/10.1002/widm.42
  19. Jain, A, Nandakumar, K, and Ross, A (2005). Score normalization in multimodal biometric systems. Pattern Recognition. 38, 2270-2285. https://doi.org/10.1016/j.patcog.2005.01.012
    CrossRef
  20. Shamim, G, and Rihan, M (2020). Multi-domain feature extraction for improved clustering of smart meter data. Technology and Economics of Smart Grids and Sustainable Energy. 5. article no 10
    CrossRef
  21. Al-Khamees, HAA, Al-A’araji, N, and Al-Shamery, ES (2022). Classifying the human activities of sensor data using deep neural network. Intelligent Systems and Pattern Recognition. Cham, Switzerland: Springer, pp. 107-118 https://doi.org/10.1007/978-3-031-08277-1_9
    CrossRef
  22. Chereshnev, R, and Kertesz-Farkas, A (2017). HuGaDB: human gait database for activity recognition from wearable inertial sensor networks. Analysis of Images, Social Networks and Texts. Cham, Switzerland: Springer, pp. 131-141 https://doi.org/10.1007/978-3-319-73013-4_12
  23. Anguita, D, Ghio, A, Oneto, L, Parra, X, and Reyes-Ortiz, JJ (2012). Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. Ambient Assisted Living and Home Care. Heidelberg, Germany: Springer, pp. 216-223 https://doi.org/10.1007/978-3-642-35395-6_30
    CrossRef
  24. Supardi, NA, Abdulkadir, SJ, and Aziz, N . An evolutionary stream clustering technique for outlier detection., Proceedings of 2020 International Conference on Computational Intelligence (ICCI), 2020, Bandar Seri Iskandar, Malaysia, Array, pp.299-304. https://doi.org/10.1109/ICCI51257.2020.9247832
  25. Tang, XF, Huang, R, Chen, Q, Peng, ZY, Wang, H, and Wang, BH (2019). An outlier detection method of low-voltage users based on weekly electricity consumption. IOP Conference Series: Materials Science and Engineering. 631. article no 042004
  26. Ariza-Colpas, PP, Vicario, E, Oviedo-Carrascal, AI, Butt Aziz, S, Pineres-Melo, MA, Quintero-Linero, A, and Patara, F (2022). Human activity recognition data analysis: history, evolutions, and new trends. Sensors. 22. article no 3401
    CrossRef
  27. Shao, J, Tan, Y, Gao, L, Yang, Q, Plant, C, and Assent, I (2019). Synchronization-based clustering on evolving data stream. Information Sciences. 501, 573-587. https://doi.org/10.1016/j.ins.2018.09.035
    CrossRef
  28. Abedin, A, Motlagh, F, Shi, Q, Rezatofighi, H, and Ranasinghe, D . Towards deep clustering of human activities from wearables., Proceedings of the 2020 International Symposium on Wearable Computers, 2020, Virtual Event, Array, pp.1-6. https://doi.org/10.1145/3410531.3414312
  29. Fahy, C, and Yang, S (2019). Finding and tracking multi-density clusters in online dynamic data streams. IEEE Transactions on Big Data. 8, 178-192. https://doi.org/10.1109/TBDATA.2019.2922969
    CrossRef

Hussein A. A. Al-Khamees received his B.S. degree in Computer Science from University of Babylon, Iraq, in 1999. He received his M.S. degree in Information Technology from the Turkish Aeronautical Association - Institute of Science and Technology, Turkey, in 2017. His research interests include data mining, data stream analysis, deep learning, and intelligent systems.

E-mail: hussein.alkhamees7@gmail.com


Nabeel Al-A’araji received his B.S. degree in Mathematics from Al-Mustansiryah University, Iraq, in 1976. He received his M.Sc. degree in Mathematics from the University of Baghdad, Iraq, in 1978 and received his Ph.D. degree in Mathematics, from University of Wales, Aberystwyth, UK, in 1988. He is currently a professor at the Department of Software, University of Babylon. His research interests include artificial intelligence, GIS machine learning, neural networks, deep learning, and data mining.

E-mail: nhkaghed@itnet.uobabylon.edu.iq


Eman S. Al-Shamery received her B.Sc., M.Sc., and Ph.D. degrees in Computer Science from the University of Babylon, Iraq, in 1998, 2001, and 2013, respectively. She is currently a professor at the Department of Software, University of Babylon. Her research interests include artificial intelligence, bioinformatics, machine learning, neural networks, deep learning, and data mining.

E-mail: emanalshamery@itnet.uobabylon.edu.iq


Article

Original Article

International Journal of Fuzzy Logic and Intelligent Systems 2022; 22(3): 267-275

Published online September 25, 2022 https://doi.org/10.5391/IJFIS.2022.22.3.267

Copyright © The Korean Institute of Intelligent Systems.

An Evolving Fuzzy Model to Determine an Optimal Number of Data Stream Clusters

Hussein A. A. Al-Khamees, Nabeel Al-A’araji, and Eman S. Al-Shamery

Department of Software, Babylon University, Babylon, Iraq

Correspondence to:Hussein A. A. Al-Khamees (hussein.alkhamees7@gmail.com)

Received: May 15, 2022; Accepted: September 6, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Data streams are a modern type of data that differ from traditional data in various characteristics: their indefinite size, high access, and concept drift due to their origin in non-stationary environments. Data stream clustering aims to split these data samples into significant clusters, depending on their similarity. The main drawback of data stream clustering algorithms is the large number of clusters they produce. Therefore, determining an optimal number of clusters is an important challenge for these algorithms. In practice, evolving models can change their general structure by implementing different mechanisms. This paper presents a fuzzy model that mainly consists of an evolving Cauchy clustering algorithm which is updated through a specific membership function and determines the optimal number of clusters by implementing two evolving mechanisms: adding and splitting clusters. The proposed model was tested on six different streaming datasets, namely, power supply, sensor, HuGaDB, UCI-HAR, Luxembourg, and keystrokes. The results demonstrated that the efficiency of the proposed model in producing an optimal number of clusters for each dataset outperforms that of previous models.

Keywords: Data stream clustering, Clusters number, Evolving mechanisms

1. Introduction

Artificial intelligence (AI) is a broad branch of computer science, and machine learning is the backbone of various techniques, among which clustering is one of the most important [1]. It aims to split a dataset into significant clusters of data samples that have a high degree of similarity between them, and a low degree of similarity with the data samples in other clusters [2].

Data stream analysis has become one of the most active and effective research fields in computer science due to the diverse challenges that it poses compared with traditional data analysis, including the indefinite size, high access, single scan, limited memory and processing time, and concept drift caused by the non-stationary condition of the environments where data originate [3]. Data stream mining links two essential computer fields: data mining tasks and data streams [4].

Many techniques for analyzing traditional data can also be implemented on data streams, with clustering being one of them. Data stream clustering methods can be categorized into five types, one of which is the density-based class. This category includes the evolving Cauchy (e-Cauchy) clustering algorithm, on which this study is based [5].

Evolving systems are naturally able to change the general structure of the model designed to describe the data stream by updating it after the arrival of each data sample. This is achieved through several mechanisms, such as adding, merging, and splitting clusters to reduce the large number of clusters generated [6].

Fuzzy systems are extensively used in different fields that are based on the concept of fuzzy logic. In addition, evolving fuzzy algorithms are considered an important type of evolving systems because of their ability to interact with the data provided [7]].

Determining an optimal number of generated clusters remains an open challenge, and traditional solutions cannot be applied to all cases. Therefore, evolving mechanisms are proposed as a solution to set an appropriate number of data stream clusters [8].

Most clustering algorithms (including e-Cauchy) generate a large number of clusters. Thus, the research question addressed in this study is how to overcome the large number of clusters generated while applying clustering algorithms to data streams.

To overcome this problem, this study proposes an evolving model based on the e-Cauchy algorithm, which adopts new evolving mechanisms such as adding and splitting clusters into high-quality and low-quality clusters by re-assigning all data samples from low-to high-quality clusters. Moreover, the paper presents a new fuzzy method for distributing testing data.

To evaluate the proposed model, several streaming datasets were used, namely, power supply, sensor, HuGaDB, UCI-HAR, Luxembourg, and keystrokes. The results demonstrated the model’s efficiency in producing an optimal number of clusters for each dataset and showed that it has a higher silhouette coefficient than other previous models, thus, outperforming them.

2. RelatedWork

This section presents the most relevant studies related to the determination of an optimal number of generated clusters.

The authors in [9] discussed the limitations of the K-means algorithm, the first of which is the determination of the number of clusters. They presented a cluster number assisted K-means (CNAK) model that estimates this number based on bipartite matching and by adjusting the algorithm parameters. This model can generate different scores for the number of clusters (NoC), which ranges from 1 to NoCmax, by applying the Kuhn-Munkres algorithm to obtain bipartite matching. The authors selected a random sample, used it as a reference, mapped other centroid clusters and compared them.

The authors of [10] used one of the most effective validity indices, the entropy measure, as an indicator to determine the optimal number of clusters. Initialization is one of the most important steps in K-means algorithms; in this model it is achieved by selecting the first data sample through entropy, hence, the model is known as an entropy-based initialization model. In addition, the model aims to maximize Shannon entropy-based objective function. The authors proved that their model is better than the K-means algorithm in terms of the number of clusters.

A density-peak-based clustering algorithm for automatically determining the number (DPADN) of clusters was proposed in [8]. This model consists of three steps. First, a density measure is designed by applying a Gaussian kernel function to compute the density for all dataset samples. Second, a precluster approach is constructed to find the center of each cluster. Finally, a method is proposed which automatically sets the center of the clusters. The evolving mechanisms of this model include searching the nearest two clusters and then merging them into one.

Data stream mining can be defined as the uncovering interesting and useful patterns from a large amount of data in a way that makes these patterns understandable. This can achieved by many techniques [11].

In recent years, clustering has become one of the most important and widely used data stream mining techniques. Data stream learning is divided into two main types: supervised and unsupervised learning. Clustering is an unsupervised learning method [12].

Different methods are used to group (cluster) all the samples in a given dataset into clusters, each containing many samples that have a high degree of similarity between each other but are not similar to the samples in other clusters [11].

There are five types of data streams clustering methods: partitioning, hierarchical, grid-based, density-based, and model-based methods [13]].

These same five types are also applied to traditional data. The model herein proposed is based on the e-Cauchy algorithm, which is a density-based method.

Cluster validation comprises three major tasks: clustering evaluation, clustering stability, and clustering tendency. Each task has several aspects. Clustering stability aims to form a good background for the clustering result sensitivity to differentiate algorithm parameters, such as the number of clusters [14].

Cluster validation is also called determining the number of clusters in a dataset. The methods applied to set the number of clusters can be classified into three groups: methods based on merging and splitting, traditional methods, and methods based on evolutionary computation. The first of these groups includes evolving methods that can be implemented on datasets to determine cluster numbers [14]

Traditional models usually fail when dealing with streaming data because of the challenges posed by this modern data type. Evolving systems represent an important and attractive solution for handling data streams [13] One of the most important characteristics of evolving systems is their ability to change the general structure of the model designed to describe the data stream. Accordingly, the major task and key feature of any evolving system is associated with several mechanisms, such as adding, splitting, merging, and removing entities. When the evolving system is based on the clustering technique, the above mechanisms are implemented in terms of clusters, i.e., adding clusters, splitting clusters, etc. [5].

This ensures greater flexibility when the system situations change into non-stationary environments where the data evolve over time or when concept drift appears. In traditional models, the data distribution is assumed to be stationary and the structure of the models and their parameters do not change over time. In evolving systems, the data environment is non-stationary; therefore, both the structure and parameters of the models change over time in what is classified as a dynamic state [8].

In these cases, the model is more related to non-stationary environments that can generate a continuous stream of data whose distribution does not remain constant.

More specifically, some designers train their models offline (offline training), which is sometimes known as batch training. The model may initially perform well, that is, the performance of the model will gradually deteriorate, especially as the upcoming data evolves. Moreover, evolving systems have the ability to forget data (especially after updating) in a proactive manner to maintain memory efficiency so that the data stream faces no problem [15].

Certainly, the analytics of these data must be performed in real time; therefore, online analytics may be implemented through evolving identification methods that allow the simultaneous adjustment of the structure and parameters [16]c.

Furthermore, when the system is time-variant, it is necessary to describe the various behaviors through the evolution of the model structure and to identify the parameters online [17].

Naturally, because there is a change in the general behavior of the data stream, the learner should evolve to keep up with the data change. The training step is the general idea behind an evolving system. This step comprises the following [18]:

  • 1) Splitting the input space via a clustering algorithm to learn the antecedent parameters.

  • 2) Tuning the parameters and structure to apply the evolution mechanisms.

  • 3) Implement a learning technique to learn the consequent parameters.

3. Proposed Method

Several clustering algorithms for data streams have been developed in recent years. In this study, we employed the e-Cauchy algorithm. The main idea was to compute the density of each arriving data sample from the dataset to construct the initial clusters, which are typically numerous. Subsequently, the cluster splitting mechanism was applied. Figure 1 shows a block diagram of the proposed model comprising five main parts that are explained in detail in the following subsections.

3.1 Data Pre-processing

Normalization is an essential data pre-processing step for most types of problems. Normalization can be accomplished through many methods, including min-max, decimal scaling, z-score, median and MAD, double sigmoid, and tanh estimators [19]. In the proposed model, the min-max technique was implemented. Suppose there is a set of matching scores {Ms}, where s = 1, 2, ..., n, then the normalized scores (Ms’) are computed as:

Ms=(Ms-min)(max-min).

The dataset is then partitioned into training (70% of the dataset) and test data (30%).

3.2 Clustering Algorithm

The core step of the e-Cauchy clustering algorithm is the computation of the density of each training data sample. It is worth noting that this algorithm uses two predefined thresholds: θ (that is set to 0.1) and the density threshold (Denthr), which is unique for any given dataset. Specifically, the latter was considered the most effective threshold.

The algorithm proceeds as follows: When the first data sample arrives, the first cluster is constructed and its parameters are set. When an additional data sample arrives, the density value is computed and compared with Denthr; if the density value is less than Denthr, then a new cluster is constructed; otherwise, the current data sample is appended to an existing cluster and the parameters of the corresponding cluster are modified. These steps are repeated for all samples of the selected dataset to simultaneously build the initial clusters, thus performing the first cluster addition mechanism.

The pseudocode for the e-Cauchy algorithm is shown in Figure 2, where data stream D consists of {x1, x2, ..., xi}, SCj refers to the samples in cluster (j), CCj denotes the center of cluster (j), and NoC is the number of clusters.

3.3 Splitting Mechanism

As mentioned above, the cluster splitting mechanism is applied to divide all the generated clusters (initial clusters from Algorithm 1 in Figure 2) into high- and low-quality clusters (HQC and LQC). By applying this mechanism, the model can evolve by re-assigning the samples in all LQCs to HQCs to produce the final HQCs.

However, both the cluster adding and splitting mechanisms occur while training the model (with training data), whereas the test data samples are assigned only to the HQCs that result from the cluster splitting mechanism. Figure 3 illustrates the pseudocode for the splitting algorithm.

3.4 Assignment of Test Data

In this model, the assignment of the test data samples require three steps: computing the distance between the current test data and the first cluster, building a sub-cluster with a radius equal to the computed distance, and determining all testing data samples that fall within the built sub-cluster. Subsequently, the distances of all data samples in the sub-cluster and their average are calculated.

As a result of these steps, a set of distances is available for each test data sample. In fact, the number of distances should be equal to the number of HQCs formed by applying Algorithm 2 in Figure 3.

In some cases, the number of distances is large; thus, additional calculation time may be required to measure each time (as in the case of the sensor data stream, which has 54 clusters) (Table 1). Therefore, it is clear that using a fuzzy membership function is very useful. In the proposed methodology all computed distances were included in a membership function, as described in line 5 of the pseudocode for the test data assignment algorithm shown in Figure 4.

Finally, the minimum distance is selected (as described in line 6 of Algorithm 3 in Figure 4) to assign the current sample and all samples within the specified radius (as shown in line 7). This step is performed to ensure that the current data sample is assigned to the nearest cluster.

3.5 Evaluation

The silhouette coefficient (SC) is an internal index that evaluates the quality of the clustering results. More specifically, SC indicates whether the data samples were correctly clustered and separated (clusters coherence and goodness). The SC for data (i) is computed as [20]:

SCx(i)=(bx(i)-ax(i))max(ax(i)·bx(i)),

where ax(i) is the average distance from x(i) to every data sample in the same cluster, and bx(i) is the lowest average distance between x(i) and every data sample in other clusters. It is worth pointing out that the SC for each stream dataset is measured before and after the evolving process for both the training and test data.

Based on the confusion matrix, which provides good details about the classifier, we use accuracy and purity [21]. Accuracy is computed according to

Accuracy=(TP+TN)(TP+TN+FP+FN),

where TP is the true positive, TN is the true negative, FP is the false positive, and FN is the false negative.

Purity can be calculated by

Purity=1ni=1nPrecision,

where the precision is given by

Precsion=(TP)(TP+FP).

The proposed model was implemented using the programming language Python 3.6.9 on a Windows 10 Professional operating system, with a 2.5 GHz core I7 CPU, and 16 GB of RAM.

4. Dataset Description

To evaluate the proposed model, different streaming datasets were used, which were generated from diverse domains, such as questionnaires, user behavior recognition, and human activity recognition. Table 1 provides a brief overview of the datasets with a numerical data type [22,23].

5. Results and Discussion

As explained earlier, the e-Cauchy clustering algorithm contains two thresholds; the most important one is Denthr. However, the value of this threshold is varies from one dataset to another. Therefore, a certain threshold value was set for each streaming dataset.

The first dataset used to test the proposed model was the power supply stream dataset, for which Denthr was set to 0.0161. Figure 5 shows the results for this dataset. The model initially generated 1,400 clusters; after implementing the evolving step, the number reduced to 678, finally achieving an optimal number of clusters of 24. The SC for the training data before evolving was −0.75, changing to 0.30 after evolution, whereas the SC for the test data before and after evolving was 0.31 and 0.33, respectively.

Considering the SC value of the testing step for the power supply dataset, which is 0.33, the proposed model outperforms the models presented in [24] and [25], which achieved an SC of 0.18 and 0.32, respectively.

The second dataset used for testing the model was the sensor data stream. For this dataset, Denthr was set to 0.0011. The initial number of clusters was 1, 887, which reduces to 1, 052 and then to 412; finally, the optimal number of clusters was 54. In terms of SC, for the training step it was −0.52 before evolving and 0.22 afterwards; whereas for the test data it was 0.08 and 0.49 before and after evolving, respectively. Figure 6 shows the results for the sensor data stream.

Based on the final SC value of the testing step for this dataset (0.49), the proposed model achieved a higher SC than the model proposed in [24], which achieved an SC value of 0.30.

The HuGaDB01 01 sensor dataset was then used to test the proposed model, and Denthr was set to 0.0038. The number of clusters was initially 852 and then 221, then decreased to 9 and finally to 4. In terms of SC, its value for the training step before evolution was −0.68, and after evolving 0.39. The SC for the testing step before evolving was −0.002, increasing to 0.52 after evolving. Figure 7 illustrates these results.

The fourth evaluation dataset was the UCI-HAR stream dataset, and its Denthr was set to 0.00028. The number of clusters was initially 3,605, which decreased to 933,412 and finally 6, which indicates the optimal NoC. The SC of the training data before evolution was −0.32, which increased to 0.41 after evolving, while the SC for the test data was 0.11 before evolution, and 0.53 thereafter. Figure 8 shows the results for the UCI-HAR dataset.

According to the SC value of the UCI-HAR dataset, the proposed model outperforms the model presented in [26], which achieved an SC of 0.441 by implementing the K-means algorithm.

Next, the model was tested on the Luxembourg stream dataset, with a Denthr of 0.0026. The number of clusters at the beginning was 1,901, then 666, and finally 2, which was the optimal NoC. The SC of the training data before evolution was −0.44, which increases to 0.57 after evolving, whereas the SC for the test data before the evolution was −0.08 and then 0.57. Figure 9 shows the results for this dataset.

The last dataset was the keystrokes stream dataset, for which Denthr was set to 0.055. Initially, the number of clusters was 592, then the NoC decreased to 39, to 8 and then to 4. The SC of the training data before and after evolution were −0.36 and 0.29, respectively. The SC for the test data before the evolution was 0.01, and then increased to 0.57. These results are shown in Figure 10.

The accuracy and purity for each stream dataset are listed in Table 2. Based on the results in Table 2, the proposed model outperforms many previous models. For the sensor data stream, this model achieved a purity of 81.25%, whereas the model in [27] had a purity of 76.5%. In the case of the UCI-HAR dataset, the accuracy of the proposed model was 77.39%, whereas that of the model in [28] was 78.79%. Similarly, the proposed model achieved an accuracy of 84.57% for the keystrokes dataset, whereas the model in [29] attained an accuracy of 77.0%.

In general, the proposed model is highly accurate in assigning the data samples to an appropriate cluster, although it is not suitable for processing highly dimensional stream datasets.

6. Conclusion

Data streams are a modern type of data that can be generated in real-world applications. Their main characteristics are their massive size, high speed, and concept drift. Many techniques can be used on data streams, including clustering, which aims to group similar data samples into different clusters. Fuzzy systems are widely used in computer science, particularly in the field of AI. Determining an optimal number of clusters remains an open problem for researchers, as there is no static method for this purpose. This paper presents a fuzzy model to overcome such a problem. The proposed fuzzy model depends on the e-Cauchy clustering algorithm, which is a density-based data stream clustering method and implements a specific fuzzy membership function. The model applies two evolving mechanisms: adding and splitting clusters. The evaluation step involves finding an optimal number of clusters, as well as computing the SC, accuracy, and purity. Six streaming datasets were used to evaluate this model, namely, power supply, sensor, HuGaDB, UCI-HAR, Luxembourg, and keystrokes. The results obtained from the proposed model were analyzed and compared with those of previous models, showing that this model is more efficient and has a better performance than other existing models.

Our future work will focus on applying the proposed model to online cloud computing and synchronizing it with login services to determine the authenticity and validity of users.

Fig 1.

Figure 1.

Block diagram of the proposed method.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 267-275https://doi.org/10.5391/IJFIS.2022.22.3.267

Fig 2.

Figure 2.

Pseudocode for the e-Cauchy algorithm.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 267-275https://doi.org/10.5391/IJFIS.2022.22.3.267

Fig 3.

Figure 3.

Pseudocode for the cluster splitting algorithm.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 267-275https://doi.org/10.5391/IJFIS.2022.22.3.267

Fig 4.

Figure 4.

Algorithm for assigning test samples.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 267-275https://doi.org/10.5391/IJFIS.2022.22.3.267

Fig 5.

Figure 5.

(a) NoC and (b) SC for the power supply dataset.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 267-275https://doi.org/10.5391/IJFIS.2022.22.3.267

Fig 6.

Figure 6.

(a) NoC and (b) SC for the sensor data stream.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 267-275https://doi.org/10.5391/IJFIS.2022.22.3.267

Fig 7.

Figure 7.

(a) NoC and (b) SC for the HuGaDB dataset.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 267-275https://doi.org/10.5391/IJFIS.2022.22.3.267

Fig 8.

Figure 8.

(a) NoC and (b) SC for UCI-HAR dataset.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 267-275https://doi.org/10.5391/IJFIS.2022.22.3.267

Fig 9.

Figure 9.

(a) NoC and (b) SC for the Luxembourg dataset.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 267-275https://doi.org/10.5391/IJFIS.2022.22.3.267

Fig 10.

Figure 10.

(a) NoC and (b) SC for keystrokes data stream.

The International Journal of Fuzzy Logic and Intelligent Systems 2022; 22: 267-275https://doi.org/10.5391/IJFIS.2022.22.3.267

Table 1 . Characteristics of numerical streaming datasets.

#DatasetClassesFeaturesSamples
1Power supply24229,928
2Sensor5452,219,803
3HuGaDB01_014392,435
4UCI-HAR656110,299
5Luxembourg2301,901
6Keystrokes4101,600

Table 2 . Quality indices for the tested stream datasets.

#DatasetAccuracy (%)Purity (%)
1Power supply72.3071.88
2Sensor81.4881.25
3HuGaDB01_0189.7589.12
4UCI-HAR77.3977.00
5Luxembourg97.8097.80
6Keystrokes84.5784.93

References

  1. Cioffi, R, Travaglioni, M, Piscitelli, G, Petrillo, A, and De Felice, F (2020). Artificial intelligence and machine learning applications in smart production: progress trends, and directions. Sustainability. 12. article no 492
    CrossRef
  2. Abdullatif, A, Masulli, F, and Rovetta, S (2018). Clustering of nonstationary data streams: a survey of fuzzy partitional methods. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 8. article no. e1258
  3. Al-Khamees, HA, and Al-Shamery, ES . Survey: Clustering techniques of data stream., Proceedings of 2021 1st Babylon International Conference on Information Technology and Science (BICITS), 2021, Babil, Iraq, Array, pp.113-119. https://doi.org/10.1109/BICITS51482.2021.9509923
  4. Al-Tarawneh, M (2021). Data Stream classification algorithms for workload orchestration in vehicular edge computing: a comparative evaluation. International Journal of Fuzzy Logic and Intelligent Systems. 21, 101-122. https://doi.org/10.5391/IJFIS.2021.21.2.101
    CrossRef
  5. Skrjanc, I, Ozawa, S, Ban, T, and Dovzan, D (2018). Large-scale cyber attacks monitoring using evolving Cauchy possibilistic clustering. Applied Soft Computing. 62, 592-601. https://doi.org/10.1016/j.asoc.2017.11.008
    CrossRef
  6. Al-Khamees, HAA, Al-A’araji, N, and Al-Shamery, ES (2021). Data stream clustering using fuzzy-based evolving Cauchy algorithm. International Journal of Intelligent Engineering and Systems. 14, 348-358. https://doi.org/10.22266/ijies2021.1031.31
    CrossRef
  7. Lughofer, E (2011). Evolving Fuzzy Systems: Methodologies, Advanced Concepts and Applications. Berlin: Springer https://doi.org/10.1007/978-3-642-18087-3
  8. Tong, W, Liu, S, and Gao, XZ (2021). A density-peak-based clustering algorithm of automatically determining the number of clusters. Neurocomputing. 458, 655-666. https://doi.org/10.1016/j.neucom.2020.03.125
    CrossRef
  9. Saha, J, and Mukherjee, J (2021). CNAK: cluster number assisted K-means. Pattern Recognition. 110. article no 107625
    CrossRef
  10. Chowdhury, K, Chaudhuri, D, and Pal, AK (2021). An entropy-based initialization method of K-means clustering on the optimal number of clusters. Neural Computing and Applications. 33, 6965-6982. https://doi.org/10.1007/s00521-020-05471-9
    CrossRef
  11. Al-Khamees, HAA, Al-Jwaid, WRH, and Al-Shamery, ES (2022). The impact of using convolutional neural networks in COVID-19 tasks: a survey. International Journal of Computing and Digital Systems. 11, 189-197. https://doi.org/10.12785/ijcds/110194
  12. Wicaksana, Wiharto AK, and Cahyani, DE (2021). Modification of a density-based spatial clustering algorithm for applications with noise for data reduction in intrusion detection systems. International Journal of Fuzzy Logic and Intelligent Systems. 21, 189-203. https://doi.org/10.5391/IJFIS.2021.21.2.189
    CrossRef
  13. Mousavi, M, Bakar, AA, and Vakilian, M (2015). Data stream clustering algorithms: a review. Int J Adv Soft Comput Appl. 7, 1-15.
  14. Hancer, E, and Karaboga, D (2017). A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number. Swarm and Evolutionary Computation. 32, 49-67. https://doi.org/10.1016/j.swevo.2016.06.004
    CrossRef
  15. Angelov, P, and Kordon, A (2010). Evolving inferential sensors in the chemical process industry. Evolving Intelligent Systems: Methodology and Applications. Hoboken, NJ: John Wiley & Sons, pp. 313-336 https://doi.org/10.1002/9780470569962.ch14
  16. Skrjanc, I, Andonovski, G, Ledezma, A, Sipele, O, Iglesias, JA, and Sanchis, A (2018). Evolving cloud-based system for the recognition of drivers’ actions. Expert Systems with Applications. 99, 231-238. https://doi.org/10.1016/j.eswa.2017.11.008
    CrossRef
  17. Souza, V, dos Reis, DM, Maletzke, AG, and Batista, GE (2020). Challenges in benchmarking stream learning algorithms with real-world data. Data Mining and Knowledge Discovery. 34, 1805-1858. https://doi.org/10.1007/s10618-020-00698-5
    CrossRef
  18. Baruah, RD, and Angelov, P (2011). Evolving fuzzy systems for data streams: a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 1, 461-476. https://doi.org/10.1002/widm.42
  19. Jain, A, Nandakumar, K, and Ross, A (2005). Score normalization in multimodal biometric systems. Pattern Recognition. 38, 2270-2285. https://doi.org/10.1016/j.patcog.2005.01.012
    CrossRef
  20. Shamim, G, and Rihan, M (2020). Multi-domain feature extraction for improved clustering of smart meter data. Technology and Economics of Smart Grids and Sustainable Energy. 5. article no 10
    CrossRef
  21. Al-Khamees, HAA, Al-A’araji, N, and Al-Shamery, ES (2022). Classifying the human activities of sensor data using deep neural network. Intelligent Systems and Pattern Recognition. Cham, Switzerland: Springer, pp. 107-118 https://doi.org/10.1007/978-3-031-08277-1_9
    CrossRef
  22. Chereshnev, R, and Kertesz-Farkas, A (2017). HuGaDB: human gait database for activity recognition from wearable inertial sensor networks. Analysis of Images, Social Networks and Texts. Cham, Switzerland: Springer, pp. 131-141 https://doi.org/10.1007/978-3-319-73013-4_12
  23. Anguita, D, Ghio, A, Oneto, L, Parra, X, and Reyes-Ortiz, JJ (2012). Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. Ambient Assisted Living and Home Care. Heidelberg, Germany: Springer, pp. 216-223 https://doi.org/10.1007/978-3-642-35395-6_30
    CrossRef
  24. Supardi, NA, Abdulkadir, SJ, and Aziz, N . An evolutionary stream clustering technique for outlier detection., Proceedings of 2020 International Conference on Computational Intelligence (ICCI), 2020, Bandar Seri Iskandar, Malaysia, Array, pp.299-304. https://doi.org/10.1109/ICCI51257.2020.9247832
  25. Tang, XF, Huang, R, Chen, Q, Peng, ZY, Wang, H, and Wang, BH (2019). An outlier detection method of low-voltage users based on weekly electricity consumption. IOP Conference Series: Materials Science and Engineering. 631. article no 042004
  26. Ariza-Colpas, PP, Vicario, E, Oviedo-Carrascal, AI, Butt Aziz, S, Pineres-Melo, MA, Quintero-Linero, A, and Patara, F (2022). Human activity recognition data analysis: history, evolutions, and new trends. Sensors. 22. article no 3401
    CrossRef
  27. Shao, J, Tan, Y, Gao, L, Yang, Q, Plant, C, and Assent, I (2019). Synchronization-based clustering on evolving data stream. Information Sciences. 501, 573-587. https://doi.org/10.1016/j.ins.2018.09.035
    CrossRef
  28. Abedin, A, Motlagh, F, Shi, Q, Rezatofighi, H, and Ranasinghe, D . Towards deep clustering of human activities from wearables., Proceedings of the 2020 International Symposium on Wearable Computers, 2020, Virtual Event, Array, pp.1-6. https://doi.org/10.1145/3410531.3414312
  29. Fahy, C, and Yang, S (2019). Finding and tracking multi-density clusters in online dynamic data streams. IEEE Transactions on Big Data. 8, 178-192. https://doi.org/10.1109/TBDATA.2019.2922969
    CrossRef