Various sequence data have grown explosively in recent years. As more and more of such data become available, clustering is needed to understand the structure of sequence data. However, the existing clustering algorithms for sequence data are computationally demanding. To avoid such a problem, a featurebased clustering algorithm has been proposed. Notwithstanding that, the algorithm uses only a subset of all possible frequent sequential patterns as features, which may result in the distortion of similarities between sequences in practice, especially when dealing with sequence data with a large number of distinct items such as customer transaction data. Developed in this article is a featurebased clustering algorithm using a complete set of frequent sequential patterns as features for sequences of sets of items as well as sequences of single items which consist of many distinct items. The proposed algorithm projects sequence data into feature space whose dimension consists of a complete set of frequent sequential patterns, and then, employs
The amount of sequence data has grown explosively in recent years, and is expected to grow more rapidly than ever. Understanding and clustering such sequence data are invaluable for most companies in identifying different groups of customers, and thereby, develop a marketing strategy tailored to each group. In the area of biological science, grouping protein sequences that share similar structures is important because such sequences would have similar functionality.
Commonly employed algorithms for clustering sequence data are based on sequence alignment methods [1–3], which have been mainly studied in the areas of bioinformatics and computational biology to deal with such biological data as DNA and protein sequences. Most of the existing sequence alignment methods are focused on developing effective similarity measures between sequences and efficient algorithms to calculate the measures. Then, with such computed similarity measures, any traditional hierarchical or partitional clustering algorithm can be employed. In these methods, calculation of the similarity or distance between two sequences can be reformulated as an optimal sequence alignment problem (see Section 2.1), which fits well in the framework of dynamic programming [3]. However, the quadratic computational complexity of pairwise comparison in the dynamic programming algorithm requires much computational time to obtain clustering results, which makes it impractical for most applications that require clustering of a large sequence dataset [4].
Other widely used algorithms for clustering sequence data are modelbased methods. In these methods, analytical or statistical models are constructed to describe the nature of each cluster of sequences, and for that purpose, suffix trees and Markov models [5–9] are frequently used. However, these methods are not without drawbacks. They also require much computational time in general since they are iteratively implemented [10].
To overcome the drawbacks of the existing approaches, a featurebased clustering approach has been proposed. For instance, Guralnik and Karypis [4] suggested a featurebased approach which finds a set of independent frequent sequential patterns called features, and projects the sequences into a new space whose dimensions consist of those features. Then, a vectorspace clustering algorithm based on Kmean is applied. To determine a set of features, they suggested a global as well as a local approach. In the global approach, two frequent sequential patterns are defined as dependent if one is subpattern of the other or the number of sequences which support both of them is greater than a predefined threshold. Otherwise, in a local approach, two frequent sequential patterns are regarded as dependent if they are partially overlapped in a particular sequence. With these criteria for determining the dependency between two frequent sequential patterns, they extracted a set of independent patterns from mined frequent sequential patterns to minimize the distortion of similarity.
The featurebased clustering approach has several advantages over the others. First, the high computational complexity of computing pairwise similarities (as in the dynamic programming approach) can be avoided, and therefore, a reduction in the computational time for clustering can be achieved especially for largescale sequence databases [3]. Second, the featurebased approach deals with the global information of the whole sequence data in the clustering process. This means that only those sequences that have frequently occurred items are clustered, while those sequences that have rarely occurred items are excluded from the clustering process even if they are much similar to each other. Consequently, it is more robust to outliers. However, the existing featurebased clustering algorithm by Guralnik and Karypis [4] uses only a subset of all possible frequent sequential patterns as features, which may result in the distortion of similarities between sequences in practice, especially when dealing with sequence data with a large number of distinct items such as customer transaction data. Moreover, it has some difficulties in dealing with sequences of sets of many distinct items. A sequence of sets of many distinct items differs from a sequence of single items in that the former consists of an ordered list of sets of many distinct items such as appeared in the sequence of customer transactions over time.
To overcome such drawbacks, a new featurebased clustering algorithm is developed for clustering of sequences with many distinct items. It is designed to deal with sequences of sets of items as well as sequences of single items. The proposed featurebased clustering algorithm regards each sequence as a binary vector in the feature space which has an orthogonal basis whose elements are the complete set of frequent sequential patterns, and employs the Kmeans clustering algorithm for clustering binary vectors. The proposed approach is similar to Guralnik and Karypis [4] in that it adopts the idea of projecting each sequence into a new space to treat it as a binary vector. However, in Guralnik and Karypis [4], only a subset of all possible frequent sequential patterns is used as features to be able to handle sequences with a few items (e.g., biological sequences) because these sequences may have too many possible frequent sequential patterns. However, using a subset of all possible sequential patterns may result in the distortion of similarity as can be seen in Figure 3 in Section 2.1. To deal with this problem, an additional investigation process of original sequence data to determine the independency among features is required in Guralnik and Karypis [4], while, in the proposed approach, such an additional investigation process is not required since the complete set of sequential patterns is used to minimize the distortion of similarity, and thereby, the computational time can be significantly reduced.
The rest of this paper is organized as follows: in Section 2, the proposed clustering algorithm is described in detail, and Section 3 presents the experimental and computational results. Finally, conclusions are provided in Section 4.
The proposed algorithm for clustering sequence data consists of: 1) mining the complete set of frequent sequential patterns and simultaneously representing the sequences as binary vectors; and 2) clustering binary vectors by employing the Kmeans clustering algorithm. In this section, clustering procedures are discussed in detail, and then these are illustrated with an example.
Sequential pattern mining is widely used in various areas for analyzing biological sequences, customer purchase behaviors, web usage patterns, intrusion detection, etc. [11, 12]. Frequent sequential patterns obtained from the mining process can explain the nature of sequence data, and therefore, can be used as a set of features for clustering [4, 13, 14]. Several methods have been developed for efficient mining of frequent sequential patterns (e.g., see [15–18]). Agrawal and Srikant [15] may be the first who dealt with the frequent sequential pattern mining problem. Given a large database of customer transactions, where each transaction consists of customerid, transaction time, and the items purchased by a customer, all the transactions of a specific customer can be represented as a sequence of sets of items. Given a userdefined minimum threshold, the problem of mining frequent sequential patterns is to find all subsequences that appear in the database more often than a userdefined minimum threshold.
Let E = {e_{1}, e_{2}, …, e_{T}} be a set of T distinct items in which e_{t} = t for t = 1, 2, …, T. Suppose that the database D of l distinct sequences is given. A sequence S_{i} for i = 1, 2, …, l is an ordered list of itemsets s_{ij} for j = 1, 2, …, m_{i}, and is denoted by [s_{ij}, j = 1, 2, …, m_{i}]. An itemset s_{ij} is a set of items, and is denoted by s_{ij} = (e_{ijk}, k = 1, 2, …, n_{ij}, e_{ijk} ∈ E). Without loss of generality, it is assumed that the elements in an itemset are sorted in ascending order of items. In a case of all n_{ij} = 1 for j = 1, 2, …, m_{i}, a sequence S_{i}, of course, can be regarded as a sequence of single items. For example, sequence S_{1} in Figure 1(a) has five itemsets: s_{11} = (1), s_{12} = (2), s_{13} = (3), s_{14} = (4) and s_{15} = (2, 2, 5, 6), where the elements of each itemset represent items. A sequence S_{i} = [s_{i}_{1}, s_{i}_{2}, …, s_{im}_{i}] is said to support a sequence S_{i}_{′} = [s_{i}_{′1}, s_{i}_{′2}, …, s_{i}_{′}_{m}_{i′}] if there exist positive integers c_{1} < c_{2} < ⋯ < c_{m}_{i′} such that s_{i}_{′1} ⊆ s_{ic}_{1}, s_{i}_{′2} ⊆ s_{ic}_{2}, …, s_{i}_{′}_{m}_{i′} ⊆ s_{ic}_{mi′}, and this relationship is denoted by S_{i} ⊃ S_{i}_{′}. The support value of a sequence S^{*} (e.g., sequential pattern) is defined as the percentage of the sequences that support S^{*}. That is,
where X represents the number of elements in a set X. In Figure 1, sequences S_{1} = [s_{11}, s_{12}, s_{13}, s_{14}, s_{15}] = [(1)(2)(3)(4) (2, 2, 5, 6)] and S_{3} = [s_{31}, s_{32}, s_{33}, s_{34}] = [(3)(7)(3, 4)(2)] support p_{10} = [p_{10,1}p_{10,2}] = [(3)(2)] because p_{10,1} ⊆ s_{13}, p_{10,2} ⊆ s_{15} and p_{10,1} ⊆ s_{31}, p_{10,2} ⊆ s_{34}, respectively. Therefore, the support value of p_{10} is 50%.
Given a userdefined minimum support threshold α, a set of all sequences whose support value is greater than or equal to α is called the complete set of frequent sequential patterns. The complete set of frequent sequential patterns implies that there are no constraints in the mining process except for the minimum support threshold. Obviously, the complete set includes all possible subsequences of a mined frequent sequential pattern. For example, as can be seen in Figure 1, if a sequential pattern p_{15} = [(3)(3, 4)] is obtained as one of frequent sequential patterns, all subsets p_{3} = [(3)], p_{4} = [(4)], p_{11} = [(3)(3)], p_{12} = [(3, 4)] are also obtained.
In the present study, the PrefixSpan [16] algorithm is employed to obtain the complete set of frequent sequential patterns which is used as a set of features for clustering sequence data. A binary matrix of original sequences can be simultaneously constructed with the mining process. The PrefixSpan algorithm mines frequent sequential patterns through examining only prefixes of sequence data and projecting their corresponding postfixes into smaller projected databases. Frequent sequential patterns can be obtained by recursively building and scanning the projected databases. The process is repeated until no frequent itemset is found in the projected database. This algorithm mines all possible frequent sequential patterns (i.e., the complete set of sequential patterns) and linearly scalable. For a detailed description of the algorithm, the reader is referred to [16].
Assume that the purchase records of four customers are given. In Figure 1(a), a total of 23 sequential patterns are obtained when the minimum support value is set to 50%, and they compose the complete set of frequent sequential patterns. Figure 1(b) shows a binary matrix that is constructed simultaneously with the pattern mining process.
Guralnik and Karypis [4] argued that the problem of similarity distortion is arisen because of high dependency and partial overlap between patterns. So, the original sequence data should be scanned to determine the independency or partial overlap among sequential patterns. And then, they insert additional constraints (e.g., the minimum length or the maximum length of sequential patterns) into the mining process to obtain an independent subset of the complete set of frequent sequential patterns. Then, original sequences are projected into a new space whose dimensions consist of selected sequential patterns by pruning one of dependent or partially overlapped sequential patterns. In Figure 1, sequential patterns from p_{6} to p_{23} are obtained under the additional constraint that the minimum length of patterns should be greater than 1. High dependency means that if a sequential pattern (e.g., p_{14} in Figure 1) is supported by a sequence (e.g., S_{3} in Figure 1), then a subpattern (e.g., p_{3} in Figure 1) of former is also supported by the latter. Those two sequential patterns (e.g., p_{14} and p_{3} in Figure 1) are called dependent. Partial overlap means that two patterns have duplicated region in a particular sequence. For example, in Figure 1, two sequential patterns p_{14} and p_{23} supported by a sequence S_{3} are defined as partially overlapped patterns due to the duplicated region (Item 7) in S_{3}.
However, the problem of similarity distortion is actually caused by using a subset of all possible sequential patterns (i.e., a subset of the complete set of frequent sequential patterns) as features rather than high dependency and partial overlap between patterns. In other words, the problem is mainly caused by the lack of information that sufficiently describes the whole sequential characteristic or nature. In Figure 1, Guralnik and Karypis [4] fail to utilize the sequential patterns from p_{1} to p_{5} due to the additional constraint even though they have sufficient information.
Especially for the case of sequences with many distinct items, the distortion of similarity can be minimized by using the complete set of frequent sequential patterns to construct a basis of the feature space. Thus, the above unnecessary data scan process can be avoided. Moreover, there is little difference between the computational times to obtain the subset (e.g., from p_{6} to p_{23} in Figure 1) and the complete set of frequent sequential patterns because most of sequential pattern mining algorithms are based on the Apriori property [19], in other words, shorter sequential patterns should be obtained first to get longer ones. Therefore, the whole computational time can be significantly reduced through eliminating unnecessary data scan process.
Before discussing the similarity distortion issue, the sequence alignment method for directly calculating the similarity between sequences is introduced. Let S_{1} and S_{2} be two sequences of m_{1} and m_{2} itemsets, respectively. First, the similarity between two itemsets s_{1}_{i} for i = 1, 2, …, m_{1} and s_{2}_{j} for j = 1, 2, …, m_{2} is defined as the ratio of the number of common items between the two itemsets to the average number of items of the two itemsets. Then, an alignment score between S_{1} and S_{2} can be calculated as the sum of similarities of itemsets for a possible alignment of S_{1} against S_{2}. The optimal alignment score is defined as the maximum alignment score between two sequences, and the similarity between the two sequences is determined as the ratio of the optimal alignment score to the average length (defined as the average number of itemsets) of the two sequences. For example, the similarity between S_{1} = [s_{11}, s_{12}] = [(1, 2, 3)(1, 2, 4, 5)] and S_{2} = [s_{21}] = [(1, 2)] can be calculated as follows. There are two possible alignments between S_{1} and S_{2} as depicted in Figure 2.
In the case of Figure 2(a), the alignment score is 0.8 because similarity between two itemsets s_{11} and s_{21} is 0.8. On the other hand, in the case of Figure 2(b), the alignment score is 0.67. Therefore, the optimal alignment score is 0.8, and then the similarity between S_{1} and S_{2} can be calculated as the optimal alignment score divided by the average number of itemsets of two sequences, i.e., 0.53.
In Figure 3, the distortion of similarity issue is illustrated with the previous example. The elements of matrices represent the similarity between sequences. Figure 3(b) shows the similarity matrix obtained by the sequence alignment method and provides the optimal sequence similarity score based on the dynamic programming. Figure 3(c) is the proposed method based on the complete set of frequent sequential patterns. And Figures 3(d), and 3(e) show similarity matrices of projected sequences (e.g., see Figure 1(b)) computed by ‘cosine’ similarity under the different constraints on the minimum length of sequential patterns. The cosine similarity can be defined as follows. Let X and Y are vectors of n dimensions. The cosine similarity between X and Y is
where ⟨X · Y⟩ represents an inner product between X and Y. For example, similarity between projected sequence data S_{1} and S_{3} in Figure 1(b) can be calculated as follows:
To measure the distortion of similarity in Figure 3, various matrix norms including 1norm, 2norm, infinity norm and Frobenius norm were used. Matrix norms are generally used to quantify the difference between two matrices. For a detailed description of the matrix norms, the reader is referred to Kim et al. [20, 21]. From the comparisons using the matrix norms (Table 1), we can find using a complete set of frequent sequential patterns as features (C in Table 1) has the smallest difference from the optimal alignment score based on the dynamic programming (B in Table 1) regardless of the choice of norm. That is, the proposed method causes less distortion of similarity than existing featurebased method using a subset of frequent sequential patterns (D and E in Table 1).
The binary matrix obtained from the mining process is used for clustering the original sequences. By treating original sequences as binary vectors, the computation time for calculating similarity between sequences can be significantly reduced compared to the sequence alignment method [3].
Rows of the binary matrix represent the original sequences in the feature space. The proposed method employs the Kmeans clustering algorithm with the cosine distance, commonly used for document clustering [22]. Detailed procedures are as follows:
Randomly select K rows of the binary matrix as initial cluster centroids.
Assign each point to its closest centroid by computing the cosine distance between the point and each cluster centroid.
Compute a new cluster centroid of each cluster, which is the mean vector of the points in that cluster after normalizing the length of each points to the unit Euclidean length.
Repeat Steps 2 and 3 until cluster centroids converge.
The time complexity of the Kmean clustering algorithm is O(lKI), where l is the number of cases (i.e., sequences in the present problem), K is the number of clusters, and I is the number of iterations of the algorithm to converge. Since l is usually much larger than both K and I, the time complexity becomes near linear to l [23].
In order to verify the effectiveness and accuracy of the proposed method (i.e., the featurebased approach plus Kmeans clustering), computational experiments were conducted using a real dataset and a synthetic dataset. There are some difficulties in directly comparing the proposed method with Guralnik and Karypis [4] because there exist some ambiguities in choosing parameters such as the minimum and the maximum length of sequential patterns or a predefined threshold in the global approach. Therefore, the proposed method was compared to: 1) the sequence alignment method combined with Kmedoid clustering; 2) the sequence alignment method combined with hierarchical clustering; and 3) the sequence alignment method combined with the density peaks clustering.
The Kmedoids clustering algorithm [24] is similar to Kmean clustering algorithm except for the step of calculating cluster centers. The Kmedoids clustering algorithm regards the cluster center of each cluster as one of the existing points that is located most centric, while in the Kmean clustering algorithm, the cluster center of each cluster is computed as the mean vector of those points in that cluster. It is impossible to compute the center of each cluster in sequence data, and therefore, the Kmedoids clustering algorithm in which cluster centers need not be computed is compared with the proposed method.
The hierarchical clustering algorithms can be classified into three categories according to how the distance between clusters is defined. They include singlelinkage, averagelinkage, and completelinkage methods [25]. Sometimes, the singlelinkage method has a tendency to group a large number of sequences into a single large cluster. This phenomenon is called the “chain effect”. The singlelinkage method was not considered since both datasets show a high “chain effect” when the method was used. Accordingly, the averagelinkage and completelinkage methods are compared with the proposed method.
The density peak clustering algorithm assumes that the cluster centers have higher densities than their neighbors and they are at a relatively large distance from points with higher densities [26]. This approach has an advantage that the cluster centers can be intuitively selected and the clusters are recognized easily regardless of their shape and of the dimensionality of data. Moreover, this algorithm can be conducted based only on the similarity matrix. For this comparison, the cluster centers are selected using γ computed by ρ (density) × δ (minimum distance between the sequence and any other sequence with higher density), ρ and δ are computed on the basis of the similarity between sequences. For a detailed description of the density clustering method, the reader is referred to Rodriguez and Laio [26].
There are two types of measures for validating clustering results. The external quality measure compares the discovered clusters to a priori clusters. On the other hand, the internal quality measure evaluates if the discovered clusters are inherently appropriate for the data without reference to external knowledge. Assuming no a priori knowledge on the clusters of the two experimental datasets, we adopt the internal quality measure in the present study.
The internal measure of cluster quality used in the present study is a weighted average similarity where the similarity between sequences is based on the optimal alignment score discussed in Section 2.1.
Let C = {C_{1}, C_{2}, …, C_{K}} be a set of clusters. The weighted average similarity is calculated as follows. A higher WAS value represents a better clustering solution.
Compute the average similarity (AS) for each cluster C_{k} for 1 ≤ k ≤ K.
where sim(S_{i}, S_{i}_{′}) is the similarity between sequences S_{i} and S_{i}_{′}.
Calculate the weighted average similarity (WAS) of clusters C as follows.
The retail market dataset used in the experiment was obtained from an ecommerce company. The dataset contains 57,686 transactions on 305 items by 5,001 customers during a sixmonth period in a certain year. In addition, 5,001 customers who made 3 or more transactions were considered. As a preprocessing step, we generated a sequence which is a list of transactions in ascending order of transaction time for each customer. Subsequently, 5,001 sequences of sets of items were generated, and therefore, this data can be categorized as “sequences of sets of items with many distinct items”.
To analyze the sensitivity of clustering results to the minimum support threshold, two different support threshold values, 1% and 3%, were considered. In this dataset, we found 657 and 110 frequent sequential patterns with the minimum support of 1% and 3%, respectively. Only the sequences that support at least one of the mined frequent sequential patterns were clustered to compare the five clustering algorithms on the same condition. Those sequences that support none of the patterns are treated as outliers. As a result, 4 original sequences were excluded with the minimum support of 1%, and 96 sequences were excluded with 3%. Then, the WASs were calculated when the number of clusters are 10, 20, and 30, respectively.
Figures 4(a) and 4(b) show the WASs for the five clustering algorithms. Experimental results show that the proposed clustering method performs better than the sequence alignment methods combined with other traditional and recent clustering algorithms regardless of the support value and the number of clusters. It is interesting to note that the WAS of the proposed featurebased clustering combined with Kmeans clustering is much higher even than that of the recent clustering algorithm based on optimal alignment score. This implies that clustering task is readily done in a vector space spanned by binary vectors obtained from the complete set of frequent sequential patterns and the proposed featurebased approach better reflects the nature of the global sequence by using the complete set of frequent sequential patterns.
The synthetic dataset T10I4D100K.dat [27] generated by IBM Almaden Quest research group was used in this study. This dataset contains 100,000 transactional data with 1,000 items. We randomly chose 5,000 transactional data to reduce the time for calculating the weighted average similarity. Actually, this dataset is not a sequential one, but just a transactional one. That is, the order of items is not important. Without loss of generality, we assumed that each sequence represents a sequence of items, and treated the dataset as “sequences of items with many distinct items”.
In a similar manner to the previous experiment, we considered two support threshold values of 0.5% and 1%, and found 1,406 and 419 frequent sequential patterns, respectively. Every sequence supported at least one of the patterns with the support of 0.5%, and 4 sequences do not support any sequential patterns with the support of 1%. As mentioned earlier, sequences that support none of the sequential patterns are treated as outliers, and thus excluded.
Figures 5(a) and 5(b) show the weighted average similarities for the five clustering algorithms. Experimental results show that the proposed clustering method performs better than the sequence alignment methods combined with other clustering algorithms regardless of the support value and the number of clusters. Compared to the previous experiment on sequences of a set of items, these results show that the quality of clusters which are obtained by the proposed method is much better than other approaches, implying that it can be used as a useful alternative when clustering sequences with many distinct items.
In this paper, a featurebased clustering algorithm for sequences with many distinct items is developed. Given two parameters, a minimum support threshold and the number of clusters, the proposed method finds meaningful clusters of sequence. It uses a complete set of frequent sequential patterns as features for clustering and employs Kmeans clustering with cosine distance to obtain meaningful clusters. Unlike the existing approaches which have difficulty in dealing with sequences of sets of items, the proposed method can deal with sequences of sets of items as well as sequences of items if there exist many distinct items. Moreover, it can be applied to large scale databases since it is linearly scalable.
To verify the effectiveness of the proposed method, it is compared to other traditional and recent clustering algorithms in terms of the weighted average similarity. Two datasets were used in the experiment. One is a retail market dataset which consists of sequences of sets of items with many distinct items, and the other is a synthetic dataset of sequences of single items with many distinct items. Experimental results show that a projection of sequences into a binary vector space helps to find better clusters, and thereby the proposed method outperforms the sequence alignment based clustering approaches regardless of the support threshold value and the number of clusters considered.
The above encouraging results require the proper value of minimum support threshold for sequential patterns. Therefore, it may be a fruitful area of future research to develop an efficient method for determining a proper support value. In addition, to reduce the computational time, a direct clustering method through merging two processes mining frequent sequential patterns and clustering projected binary vectors can be developed in further study.
This study was supported by the Research fund for a new professor by the Seoul National University of Science and Technology (SeoulTech).
No potential conflict of interest relevant to this article was reported.
Example of constructing the binary matrix. (a) Sequence data and the complete set of frequent sequential patterns with support of 50%. (b) Constructed binary matrix.
Example of two possible alignments: (a) first alignment and (b) second alignment.
Similarity distortion problem with constrained patterns. (a) Sequence data. (b) Similarity based on the optimal alignment score. (c) Similarity based on a complete set (constraints: support= 50%). (d) Similarity based on a subset of the complete set (constraints: support= 50% min length= 2). (e) Similarity based on a subset of the complete set (constraints: support= 50% min length= 3).
Experimental results for the retail market dataset with the support of 1% (a) and 3% (b).
Experimental results for the synthetic dataset with the support of 0.5% (a) and 1% (b).
Comparison of similarity matrices using matrix norms
Optimal alignment score 
Complete set 
Difference 
Subset length= 2 
Difference 
Subset length= 3 
Difference 


1Norm  2.780  2.730  2.490  0.290  2.200  0.580  
2Norm  2.649  2.484  2.242  0.408  1.999  0.650  
Infinity norm  2.780  2.730  2.490  0.290  2.200  0.580  
Frobenius norm  2.804  2.728  2.606  0.198  2.614  0.190 
Email: shwang@seoultech.ac.kr
Email: ftgog@mju.ac.kr