Title Author Keyword ::: Volume ::: Vol. 18Vol. 17Vol. 16Vol. 15Vol. 14Vol. 13Vol. 12Vol. 11Vol. 10Vol. 9Vol. 8Vol. 7Vol. 6Vol. 5Vol. 4Vol. 3Vol. 2Vol. 1 ::: Issue ::: No. 4No. 3No. 2No. 1

Simultaneous Learning of Sentence Clustering and Class Prediction for Improved Document Classification

Minyoung Kim

Department of Electronics & IT Media Engineering, Seoul National University of Science & Technology, Seoul, Korea
Correspondence to: Minyoung Kim, (mikim@seoultech.ac.kr)
Received February 13, 2017; Revised March 20, 2017; Accepted March 23, 2017.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract

In document classification it is common to represent a document as the so called bag-of-words form, which is essentially a global term distribution indicating how often certain terms appear in a text. Ignoring the spatial statistics (i.e., where in a text they appear) can potentially lead to a suboptimal solution. The key motivation or assumption in this paper is that there may exist underlying segmentation of sentences in a document, and perhaps this partitioning might be intuitively appealing (e.g., each group corresponds to a particular sentiment or gist of arguments). If the segmentation is known somehow, terms belonging to the same/different groups can potentially be treated in an equal/different manner for classification. Based on the idea, we build a novel document classification model comprised of two parts: a sentence tagger that predicts the group labels of sentences, and a classifier that forms the input features as a weighted term frequency vector that is aggregated from all sentences but weighed differently cluster-wise according to the prediction in the first model. We suggest an efficient learning strategy for this model. For several benchmark document classification problems, we demonstrate that the proposed approach yields significantly improved classification performance over several existing algorithms.

Keywords : Machine learning, Document classification, Sequence labeling, Term weighting
1. Introduction

Increased availability of document/text data from social networking services and Internet blogs or news articles demands efficient and accurate algorithms for automatic processing, organization, and indexing of text data. Document classification, the task of classifying a document into one of the pre-defined classes, is certainly the core problem for these higher-level goals. In the fields of natural language processing, machine learning, and data mining, the problem has received significant attention for recent years. The applications are also wide-spread, including among others: predicting users’ reading preferences [1, 2], news article classification [35], categorizing web pages [68], and e-mail spam filtering [912], to name just a few.

In document classification it is common to form the so called bag-of-words (BOW) representation, which is essentially a global term distribution, namely how often certain terms appear in a text. Also referred to as the term frequency (tf) vector, the BOW representation characterizes a document by the most frequent words that were used, and thus they often carry the central theme/message of the document. One caveat, however, is that all terms are considered equally important (e.g., regardless of whether stop words or key nouns/verbs). The term weighting approaches remedy this issue by assigning different weights to terms according to their importance. The inverse document frequency (idf), the most popular term weighting scheme, weighs each term by the reciprocal of the document frequency (i.e., the number of training documents that contain the term), which effectively diminishes the contribution of stop words (e.g., articles a or the).

Later on, considerable research attempts have been introduced to estimate more discriminative data-driven term weightings for better prediction performance. Most of these approaches exploit the class labels in the training data, specifically collecting information-theoretic statistics such as χ2-metric, information gain, and odds ratio, to name a few [13, 14]. For instance, the recent tf-rf strategy [15] weighs each term according to the relevance score: in binary classification, defined as the ratio between the numbers of positive and negative training documents that contain the term. The intuition is that terms with high relevance scores are considered as most salient terms for deciding class labels.

Although these term weighting approaches have achieved considerable improvement, they are still based on the BOW representation for the entire document. That is, how often certain terms appear does matter, but where they appear is completely ignored. One key insight related to this issue is as follows: the same term behaves differently depending on where it belongs to (e.g., it has stronger impact when used in a phrase with positive sentiment, while much weaker when found in a negative phrase, or vice versa). Consequently, ignoring the spatial statistics can potentially lead to a suboptimal solution. Obviously a document is a sequentially structured collection of sentences, and taking into account such a spatial information is important for improving prediction performance.

The key motivation or assumption in this paper is that there may exist underlying segmentation of sentences in a document, where it is not necessary but this partitioning may be intuitively appealing (e.g., each group corresponds to a particular sentiment or gist of arguments). Here we choose sentences as basic units for clustering although it can be extended to paragraphs or others. If the segmentation is known somehow, terms belonging to the same/different groups can potentially be treated in an equal/different manner for classification: for instance, we may assign identical/different weights (contributions) for the terms in the final feature representation.

Based on this motivating idea, we build a novel document classification model which is comprised of two parts. In the first part we aim to segment sentences into a pre-defined set of clusters. This can be done by building and learning a sentence tagger that predicts the group labels of sentences. Specifically we use the probabilistic model conditional random field (CRF) [16, 17] for its ability to capture the sequential statistical dependency of output (group label) variables that can yield high predictive power. The sentence labels required to train the CRF model are typically unavailable in the training data, and we rather estimate them from the error signal evaluated in the second model part. The second part is a conventional classifier (e.g., logistic regression model). However, it is specialized in that we form the input features as the weighted term frequencies that are aggregated from all sentences, but weighed differently cluster-wise according to the prediction in the first model.

We suggest an efficient learning strategy for this model, which also accounts for uncertainty in the group label prediction in an effective way. In particular, we learn the model in an alternating fashion: learn one component model while fixing the other part, and furthermore utilizing the predictions made by the latter. More specifically, we learn the first CRF model using the group labels of the sentences that minimize the classification loss in the second part, and at the same time we learn the classification model using the sentence segmentation information predicted by the CRF model. For several benchmark document classification problems, we demonstrate that the proposed approach yields significantly improved classification performance over several existing algorithms.

2. Problem Setup and Prior Approaches

In this section we formally setup the problem of document classification, and briefly review some related approaches that specifically utilize the term weighting paradigm.

As a training set we are given n samples of document-class pairs, $D={(Di,ci)}i=1n$. We define the vocabulary as a set of all terms (or words) in the training corpus $V={ωk}k=1V$ where we denote by V the size of the vocabulary. We deal with the binary classification (The multi-class problems can be straightforwardly handled by standard binarization tricks (e.g., the one-vs-others treatment in [18]), thus c ∈ {+1,−1}. For the class label statistics, we let $pk1$ be the number of positive documents in the training set that contain the term ωk, and $nk0$ be the number of negative training documents where ωk is absent. Similarly, $pk1$and $nk0$ can be defined. The numbers of positive and negative documents can be derived as: $n+=pk1+pk0, n-=nk1+nk0$ for all k (obviously, n = n+ + n).

For learning a classifier, it is useful to represent a document as a feature vector. The tf representation is one of the most popular methods. Denoted by φ(D) = [tf1, tf2, . . . , tfV ] with tfk being the frequency counts of the term ωk in D, the tf vector is a BOW representation that does not capture any spatial information like where in D the words appear. Motivated by the intuition that certain words are more important than others in classifying documents, several term weighting strategies have been proposed. The key idea is to weigh the entries of the tf vector differently. More specifically the final feature representation x(D) is formed by x(D) = b ⊗ φ(D), where b = [b1, b2, . . . , bV ] ∈ ℝV is the term weighting vector, and the operator ⊗ indicates element-wise product.

For instance, the well-known idf scheme forms b as the reciprocals of the numbers of training documents that contain the terms, which has the effect of focusing more on unique terms and ignoring stop words. Some recent approaches further exploit the class labels in the training set to find a term weighting vector that is more discriminative and specialized to the particular classification task at hand. Most of these approaches typically utilize the information-theoretic statistics (e.g., information gain or χ2-statistics), essentially judging the importance of individual words by certain imbalance measures on the positive/negative class proportions. Below we list the five most popular term weighting schemes with formal definitions. Details or motivations of these weighting formulations can be found in the accompanying references.

1. idf: Inverse document frequency; $bk=log(npk1+nk1)$.

2. chi2: The χ2-statistics based term weighting [13, 14]; $bk=n(pk1nk0-pk0nk1)2n+n-(pk1+nk1)(pk0+nk0)$.

3. ig: The information-gain term weighting [13, 14]; $bk=pk1n log((n/n+)pk1pk1+nk1)+pk0n log((n/n+)pi0pk0+nk0)+nk1n log((n/n-)nk1pk1+nk1)+nk0n log((n/n-)nk0pk0+nk0)$.

4. rf: The relevance frequency term weighting [19]; $bk=log(2+pk1nk1)$, which considers the ratio between the numbers of positive and negative documents that contain the term.

5. or: The odds-ratio term weighting [13, 14]; $bk=pk1nk0pk0nk1$, which is similar to rf, but also takes into account absence of the term.

In practice the prediction performance of the above term weighting methods is often significantly improved over the non-weighted tf representation. However, it appears that there is no single robust one that performs well consistently: one strategy performs well on some specific cases while fails severely in other situations. Motivated by this, in our prior work [20] we have proposed several reasonable methods to combine multiple different term weighting vectors that yield a robust classifier for different types of datasets. Although these approaches were successful in many situations, the main drawback is again that a single term weighting vector is constructed for the entire document in a pure BOW manner, failing to account for the spatial (sequential) statistics of the words in a document. In the next section, we introduce a novel document classification model that takes into account the sequential dependency of the sentences constituting the document.

3. Our Approach

We begin with illustrating an example that motivates our approach. Let us consider the sentiment classification task. Even though a document as a whole can be classified as a single sentiment (either positive or negative), some of the individual paragraphs or sentences may express an opposite sentiment as well as different themes from others. It is often the case that the same word is used to express positive meaning in one place, but has a negative impact in another, depending on which type of paragraph/sentence it belongs to. For instance, the word “war” can have a negative effect when it appears in a sentence that contains phrases like “cause death” or “destroy buildings”. However, if it is found in a paragraph/sentence whose main theme is war against negative things (e.g., drugs or terrors), or if it is used solely for highlighting some positive meaning (e.g., we always pursue peace even in the war), then its negative contribution has to be reduced in the overall sentiment decision. On the other hand, the negative impact should be exaggerated in the former case.

Motivated by this, our main idea is to cluster the sentences (We opt for sentence as a minimal unit to apply the same term weighting vector, but it can be a paragraph or others) in a document into several groups (say, there are K different groups), and for each group j ∈ {1, . . . ,K}, we apply the same term weighting vector b(j) to the terms that belong to the group j. To do so, one has to decide the group label of each sentence. This should be done in an unsupervised manner due to the expensive cost of human labeling. Also, unlike conventional clustering that takes samples (sentences in our case) independently with one another, we rather adopt a sequence prediction model (e.g., CRFs [16]) to take into account the sequential statistical dependency among sentences (e.g., two adjacent sentences are more likely to belong to the same group). Detailed model architecture and the learning strategy are described in the subsequent sections.

3.1 Model Architecture

For each training instance (document Di and its class label ci), we let Ti be the number of sentences comprising Di (i = 1, ..., n). We often drop the sample index subscript i (e.g., D instead of Di) for notational simplicity. For a given document D = s1, . . . , sT where st indicates the t-th sentence in the sequential order, we compute the term frequency vectors, one for each individual sentence, and denote them by Φ = φ(s1), . . . , φ(sT ). That is, Φ is a length-T sequence of V -dimensional tf vectors of sentences. We introduce latent variables of group labels that the sentences belong to, Y = y1, . . . , yT where yt is the group label of st. We denote the total number of sentence groups by K, so yt ∈ {1, . . . ,K}.

Ultimately we aim to model the document-class distribution P(c|D) for a given document D = s1, . . . , sT . Using simple algebra, the model can be written as:

$P(c∣D)=∑YP(Y∣D)·P(c∣Y,D).$

In (1) the summation is over all possible sentence label sequences, which amounts to exponential (KT ) possible instances. Later we will discuss in detail how to circumvent this computational issue via slice-wise independent approximation. The model (1) is comprised of two components: the first one P(Y |D) is a sentence label predictive model while the second part is the document classification model that takes both tf vectors and labels of sentences as input covariates. In what follows we describe each model in greater details.

The first model P(Y |D) is essentially a sequence prediction model, and we adopt the popular conditional random fields (CRFs) [16]. The CRF models are successful in diverse sequence prediction problems for their effective modeling of the sequential dependency among the samples in a sequence. The main idea of the CRFs is to represent a conditional distribution as a log-linear model, where the linear feature function is designed to capture the input-output functional relationship as well as the sequentially structured dependency over the output variables. Specifically in our sentence label prediction model, we introduce linear weight parameter vectors θj ∈ ℝV (j = 1, . . . ,K), where each θj is applied uniformly to all sentences belonging to the group j. We also define the temporal dependency score by the parameter θj;k for two adjacent sentence labels j and k (j, k = 1, . . . ,K). Then the model can be formally written as follows:

$P(Y∣D)∝exp (∑t=1T∑j=1KI(yt=j)θj⊤φ(st)+∑t=2T∑j,k=1KI(yt=j,yt-1=k)θj,k).$

We let θ be the whole parameters of the model, that is, $θ={θj}j=1K,{θj,k}j,k=1K$.

The second model P(c|Y,D) decides the class label of the document, and we consider the sentence labels Y discovered from the first part as input covariates. Specifically we form a feature vector ψ(Y,D) as:

$ψ(Y,D)=∑t=1T∑j=1KI(yt=j)b(j)⊗φ(st),$

where ${b(j)}j=1K$ are the term weighting vectors. In (3) each b(j) is applied to all sentences that are labeled as j, and thus considered as a term weighting vector responsible for the sentence group j. Although it is possible to learn {b(j)} entirely from data, this may lead to overfitting. Rather we use the fixed term weighting vectors as described in Section 2 (i.e., idf, chi2, ig, rf, and or vectors). We then have K = 5. In this way we urge the model to partition a document into several groups, where a specific term weighting vector is applied to each group. With the feature vector in (3) as an input, we define the classification model as a simple logistic classifier:

$P(c=+1∣Y,D)=exp(w⊤ψ(Y,D))1+exp(w⊤ψ(Y,D)),$

where w ∈ ℝV is the parameter vector of the model.

3.2 Learning Strategy

Although it is ideal to directly optimize (1) with respect to the model parameters θ and w, this becomes computationally infeasible due to the exponentially (in the number of sentences in a document) many term to be considered. Rather we propose a learning strategy similar to the expectation maximization (EM) [21], but the parameter optimization alternates between two models. Note that each partial optimization has to deal with the difficult expectation over all possible Y ‘s (e.g., the objective function for optimizing the second model (over w) is [P(c|Y,D)] where the expectation is with respect to the CRF predictive distribution P(Y |D)). We approximate the objective by finding the most plausible instance Y (In fact we do soft label assignment for each yt instead of hard assignment as discussed shortly), and fix it during the parameter learning. The other optimization is done similarly by replacing the expectation with the most plausible point estimate. The details are described in the below.

First, we fix the CRF parameters θ in P(Y |D) and optimize the logistic classifier over w. Instead of using the most plausible label sequence Y (i.e., point estimate), we further exploit the prediction uncertainty by adopting the soft label assignment probabilities P(yt|D) (K-dim vector) for each t. The sentence-wise label distribution P(yt|D) can be found easily by the forward-backward inference algorithm in the CRF model. The technical details can be found in [16]. Once we evaluate P(yt|D) for all t, the feature vector ψ (Y,D) in (3) can be extended to the label-averaged one, denoted by ψ (D) (with no dependency on Y now):

$ψ(D)=∑t=1T∑j=1KP(yt=j∣D)b(j)⊗φ(st).$

Unlike using a single point estimate Y in (3), all plausible labels (i.e., high P(yt = j|D)) are effectively accounted for in (5).

Then using (5) as input feature vectors, we can learn w from standard logistic regression training. It is the maximum likelihood estimation, and all the training instances are considered. Also, it is typical to add the regularization term $∣∣w∣∣22$ in the optimization for better generalization performance. More formally, we solve:

$maxw∑i=1n(ciw⊤ψ(Di)-log (1+eciw⊤ψ(Di)))+γ∣∣w∣∣22,$

where γ ≥ 0 is the constant that trades off the model regularizer against the log-likelihood.

Next, with w fixed in P(c|Y,D), we learn the CRF parameters θ. In the original model (1), one has to marginalize out Y with respect to the unnormalized distribution P(c|Y,D). Unfortunately, this is computationally infeasible, and another possibility is to approximate it by the single instance Y that maximizes P(c|Y,D). However, finding the single best instance also leads to a highly difficult combinatorial optimization. Instead, we aim to estimate the sentence-wise label distributions P(yt|D)’s in (5). That is, we regard as optimization variables the soft assignment probability vectors ξt = [P(yt = 1), . . . , P(yt = K)] for t = 1, . . . , T, and maximize the log-likelihood of the logistic regression model. More specifically, for each training sample i, we solve:

$max{ξt}t=1T,ψ ciw⊤ψ-log (1+exp(ciw⊤ψ))$$s.t. ψ=∑t=1T∑j=1Kξt(j)b(j)⊗φ(st)$$ξt≥0,∑j=1Kξt(j)=1, t=1,…,T.$

Here (8) indicates that the feature vector ψ for the logistic regression is parameterized by the soft assignment vectors ξt’s. Also, (9) enforces ξt’s to be probability distributions for all t. In this way we relax the problem into a continuous optimization that can be solved easily by the simple projected gradient ascent method: for each t, (i) update ξtξt + ηξth where h is the objective function in (7) and η (> 0) is the learning rate, and (ii) project ξt to the probability simplex by zeroing out all negative entries of ξt and dividing it by the sum of the entries. The maximizer vectors ξt’s are then used in the CRF learning. In the CRF training [16], since we need training samples of paired sequences (Y,D), we generate (sample) the label sequence Y from ξt’s optimized in the previous step. The CRF training is done straightforwardly with the sampled label sequences.

Overall, we alternate the two steps discussed so far (i.e., learning w with θ fixed, and learning θ with w fixed) until convergence.

4. Evaluation

In this section the prediction performance of the proposed approach is tested on several document classification datasets. We note that the experimental setups and datasets used are very similar to those in our prior work [20] where we briefly describe them below. Our approach is contrasted with various term weighting methods (discussed in Section 2) including the strategies to combine these base term weighting vectors proposed in our prior work [20]. Specifically we compare our approach with two previous weight mixing approaches: (i) loss minimization with the convex hull constraint (denoted by mixch) and (ii) mini-max robust classifier formulation (referred to as mini-max). The features learned in these methods are fed into the SVM classifier for test prediction. Our approach of simultaneous learning of the sentence label tagger and the document class predictor is denoted by lab-class.

We particularly consider three popular benchmark datasets in text classification: Reuters-21578, WebKB, and 20 Newsgroups. The Reuters 21578 dataset (http://www.daviddlewis.com/resources/testcollections/reuters21578/) consists of documents on the Reuters newswire in 1987. For the R8 dataset comprised of documents from 8 classes including crude, earn, ship, and so on, we collect about 7,000 documents with vocabulary size approximately 16,000. The WebKB dataset (http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/) contains approximately 4,000 web pages from various universities in US. The task is to classify documents in one of the four classes: student, faculty, course, and project. The number of unique terms in the dataset is about 8,000. The 20 Newsgroups (http://people.csail.mit.edu/jrennie/20Newsgroups/) is a relatively large dataset comprised of about 20,000 newsgroup documents classified into one of the 20 different subject categories. Some of the class categories are considerably overlapped taxonomically, making accurate prediction difficult. The corpus consists of about 70,000 terms.

We apply the conventional pre-processing for the data, mainly the word stemming that replaces all grammatical variations of words with root words. For this purpose we use the famous Porter’s Stemmer algorithm [22]. For each of these datasets, we randomly split the data into 50%/50% training/test sets, repeated for 10 random runs to report the average performance. As mentioned earlier, we transformed the multi-class problems into multiple binary classification tasks by the one-vs-others treatment.

Since the class distributions in the datasets are considerably imbalanced (even more severe due to the one-vs-others binarization), we employ the F-score as a performance measure for the competing classification methods. Although it is well-known in the information retrieval literature, here we briefly describe how the F-scores can be computed. The F-score basically harnesses both precision and recall scores of the classifier. The precision (p) is the proportion of the true positives (TP) out of the predicted positives (PP), that is, $p=TPPP$, while the recall (r) is the ratio of the true positives out of the entire positive instances, $r=TPWP$ where WP is the number of positive documents in the test set. The F-score is then defined as $2prp+r$.

Due to the binarization of the original multi-class problems, we need a way to average the multiple F-scores from the individual problems. One reasonable approach introduced in [19], also adopted in this paper, is in two folds: (i) micro-averaged F-score defined as $2PRP+R$ where $P=∑cTP(c)∑cPP(c)$ and $R=∑cTP(c)∑cWP(c)$ are based on the averaged positives (here TP(c), PP(c), and RP(c) are from the c-th binary problem) and (ii) macro-averaged F-score defined to be $2P′R′P′+R′$P′ = (1/K)∑cp(c) and R′ = (1/K)∑cr(c) with the averaged precision and recall (here p(c) and r(c) are the precision and the recall for the c-th problem).

We report the prediction results of the competing approaches in Table 1. For the F-scores averaged over 10 random repetitions, the best performing single term-weighting approaches (among idf, chi2, ig, rf, and or strategies) as well as two mixed term-weighting methods (mix-ch and mini-max) in [20] are depicted to be compared with our approach (lab-class). The results indicate that the proposed approach performs the best among the competing term weighting strategies and mixing algorithms, consistently for all datasets. This can be attributed to the fact that our approach effectively takes into account the spatial term/word information via the two-stage modeling: sentence-wise segmentation for tagging theme/sentiment group labels for sentences, and the classification strategy that applies different term weightings according to the estimated theme groups.

5. Conclusion

In this paper we have addressed the drawback of the conventional document classification methods that are based on the BOW representation. Our approach effectively deals with the spatial statistics of words in a document by incorporating a sentence segmentation model. To account for the sequential dependency of the sentence theme labels, we have utilized the CRF structured tagging model, while the document classifier model exploits the predicted sentence labels in its prediction by applying label-wise different term weightings. Each of these two components in our model is learned from each other by the principle: best (soft) labels are optimized in one component, then exploited in learning the other. The proposed alternating learning algorithm turns out to be effective in accurate classification. Empirically, the proposed approach yields statistically significant performance improvement over existing term weighting strategies. While we have used fixed base term weighting vectors in the model training, it is also interesting to see the model behavior when the base vectors are also learned simultaneously. This is left as our future work.

Acknowledgements

This study was supported by Seoul National University of Science & Technology.

Conflict of Interest

TABLES

Table 1

F-scores

MethodsReuters-21578WebKB20-Newsgroup
Micro F-scoreMacro F-scoreMicro F-scoreMacro F-scoreMicro F-scoreMacro F-score
Single TW0.8453 ± 0.00730.8460 ± 0.00710.8984 ± 0.00330.8990 ± 0.00700.5958 ± 0.00580.5961 ± 0.0053

mix-ch0.9132 ± 0.00500.9090 ± 0.00770.9012 ± 0.00350.9046 ± 0.00110.6613 ± 0.00050.6595 ± 0.0025

mini-max0.9087 ± 0.00760.9136 ± 0.00520.9190 ± 0.00400.9219 ± 0.00460.6742 ± 0.00040.6622 ± 0.0009

lab-class0.9257 ± 0.00640.9268 ± 0.00710.9258 ± 0.00370.9289 ± 0.00360.6967 ± 0.00050.6875 ± 0.0013

Single TW indicates the best performing single term-weighting strategy among five methods in Section 2. The proposed approach is denoted as lab-class.

References
1. Domingos, PM, and Pazzani, MJ 1996. Beyond independence: conditions for the optimality of the simple Bayesian classifier., Proceedings of 1996 International Conference on Machine Learning, Bari, Italy, pp.105-112.
2. Lang, K . Newsweeder: learning to filter netnews., Proceedings of 1995 International Conference on Machine Learning, Tahoe City, CA, pp.331-339.
3. Joachims, T 1998. Text categorization with suport vector machines: learning with many relevant features., Proceedings of 10th European Conference on Machine Learning, Chemnitz, Germany, pp.137-142.
4. Li, JR, and Yang, K 2010. News clustering system based on text mining., Proceedings of 2010 IEEE International Conference on Advanced Management Science, Chengdu, China, Array.
5. Chy, AN, Seddiqui, MH, and Das, S 2013. Bangla news classification using naive Bayes classifier., Proceedings of 16th International Conference on Computer and Information Technology, Khulna, Bangladesh, Array.
6. Craven, M, DiPasquo, D, Freitag, D, McCallum, A, Mitchell, T, Nigam, K, and Slattery, S 1998. Learning to extract symbolic knowledge from the World Wide Web., Proceedings of the 15st National Conference on Artificial Intelligence, Madison, WI.
7. Shavlik, J, and Eliassi-Rad, T 1998. Intelligent agents for Web-based tasks: an advice-taking approach., Proceedings of the 15st National Conference on Artificial Intelligence, Madison, WI.
8. Youquan, H, Jianfang, X, and Cheng, X 2011. An improved naive Bayesian algorithm for web page text classification., Proceedings of 8th International Conference on Fuzzy Systems and Knowledge Discovery, Shanghai, China, pp.1765-1768.
9. Lewis, DD, and Knowles, KA (1997). Threading electronic mail: a preliminary study. Information Processing and Management. 33, 209-217.
10. Sahami, M, Dumais, S, Heckerman, D, and Horvitz, E 1998. A Bayesian approach to filtering junk E-mail., Proceedings of the 15st National Conference on Artificial Intelligence, Madison, WI.
11. Liu, WY, Wang, L, and Wang, T 2010. Online supervised learning from multi-field documents for email spam filtering., Proceedings of 2010 International Conference on Machine Learning and Cybernetics, Qingdao, China, Array.
12. Pandey, U, and Chakravarty, S 2010. A survey on text classification techniques for e-mail filtering., Proceedings of 2010 Second International Conference on Machine Learning and Computing, Bangalore, India, Array.
13. Debole, F, and Sebastiani, F 2003. Supervised term weighting for automated text categorization., Proceedings of the 2003 ACM Symposium on Applied Computing, Melbourne, FL, Array.
14. Deng, ZH, Tang, SW, Yang, DQ, Li, MZLY, and Xie, KQ (2004). A comparative study on feature weight in text categorization. Advanced Web Technologies and Applications 2004: Lecture Notes in Computer Science, Yu, JX, ed. Berlin: Springer, pp. 588-597
15. Lan, M, Tan, CL, and Low, HB 2006. Proposing a new term weighting scheme for text categorization., Proceedings of the 21st National Conference on Artificial Intelligence, Menlo Park, CA, pp.763-768.
16. Lafferty, JD, McCallum, A, and Pereira, FCN 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data., Proceedings of 2001 International Conference on Machine Learning, Williamstown, MA, pp.282-289.
17. Sha, F, and Pereira, F 2003. Shallow parsing with conditional random fields., Proceedings of 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Edmonton, Canada, Array, pp.134-141.
18. Crammer, K, and Singer, Y (2001). On the algorithmic implementation of multiclass kernel-based vector machines. The Journal of Machine Learning Research. 2, 265-292.
19. Lan, M, Tan, CL, Su, J, and Lu, Y (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence. 31, 721-735.
20. Kim, M (2016). Robust algorithms for combining multiple term weighting vectors for document classification. International Journal of Fuzzy Logic and Intelligent Systems. 16, 81-86.
21. Dempster, AP, Laird, NM, and Rubin, DB (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B. 39, 1-38.
22. Porter, MF (1980). An algorithm for suffix stripping. Program. 14, 130-137.
Biography

Minyoung Kim received his BS and MS degrees both in Computer Science and Engineering from Seoul National University, South Korea. He earned a PhD degree in Computer Science from Rutgers University in 2008. From 2009 to 2010 he was a postdoctoral researcher at the Robotics Institute of Carnegie Mellon University. He is currently an associate professor in Department of Electronics and IT Media Engineering at Seoul National University of Science and Technology in Korea. His primary research interest is machine learning and computer vision. His research focus includes graphical models, motion estimation/tracking, discriminative models/learning, kernel methods, and dimensionality reduction.

E-mail: mikim@seoultech.ac.kr

December 2018, 18 (4)
Full Text(PDF) Free

Services

Funding Information