search for




 

A Scalable Feature Based Clustering Algorithm for Sequences with Many Distinct Items
Int. J. Fuzzy Log. Intell. Syst. 2018;18(4):316-325
Published online December 25, 2018
© 2018 Korean Institute of Intelligent Systems.

Sangheum Hwang1 and Dohyun Kim2

1Department of Industrial & Information Systems Engineering, Seoul National University of Science and Technology, Seoul, Korea 2Department of Industrial and Management Engineering, Myongji University, Yongin, Korea
Correspondence to: Dohyun Kim (ftgog@mju.ac.kr)
Received November 14, 2018; Revised December 15, 2018; Accepted December 21, 2108.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Various sequence data have grown explosively in recent years. As more and more of such data become available, clustering is needed to understand the structure of sequence data. However, the existing clustering algorithms for sequence data are computationally demanding. To avoid such a problem, a feature-based clustering algorithm has been proposed. Notwithstanding that, the algorithm uses only a subset of all possible frequent sequential patterns as features, which may result in the distortion of similarities between sequences in practice, especially when dealing with sequence data with a large number of distinct items such as customer transaction data. Developed in this article is a feature-based clustering algorithm using a complete set of frequent sequential patterns as features for sequences of sets of items as well as sequences of single items which consist of many distinct items. The proposed algorithm projects sequence data into feature space whose dimension consists of a complete set of frequent sequential patterns, and then, employs K-means clustering algorithm. Experimental results show that the proposed algorithm generates more meaningful clusters than the compared algorithms regardless of the dataset and parameters such as the minimum support value of frequent sequential patterns and the number of clusters considered. Moreover, the proposed algorithm can be applied to a large sequence database since it is linearly scalable to the number of sequence data.
Keywords : Sequence data, Feature-based clustering, Frequent sequential patterns