Article Search
닫기

Original Article

Split Viewer

Int. J. Fuzzy Log. Intell. Syst. 2016; 16(4): 293-298

Published online December 25, 2016

https://doi.org/10.5391/IJFIS.2016.16.4.293

© The Korean Institute of Intelligent Systems

Document Summarization via Convex-Concave Programming

Minyoung Kim

Department of Electronics & IT Media Engineering, Seoul National University of Science & Technology, Seoul, Korea

Correspondence to :
Minyoung Kim, (mikim@seoultech.ac.kr)

Received: November 21, 2016; Revised: December 12, 2016; Accepted: December 13, 2016

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Document summarization is an important task in various areas where the goal is to select a few the most descriptive sentences from a given document as a succinct summary. Even without training data of human labeled summaries, there has been several interesting existing work in the literature that yields reasonable performance. In this paper, within the same unsupervised learning setup, we propose a more principled learning framework for the document summarization task. Specifically we formulate an optimization problem that expresses the requirements of both faithful preservation of the document contents and the summary length constraint. We circumvent the difficult integer programming originating from binary sentence selection via continuous relaxation and the low entropy penalization. We also suggest an efficient convex-concave optimization solver algorithm that guarantees to improve the original objective at every iteration. For several document datasets, we demonstrate that the proposed learning algorithm significantly outperforms the existing approaches.

Keywords: Document summarization, Natural language processing, Text mining, Optimization

Document summarization is the task of automatic generation of a succinct summary or gist from a document or multiple related documents. While it has a relatively long (more than 50 years) history of research in natural language processing (NLP), text mining, and related fields, the document summarization has received unprecedented attention recently due to the enormous amount of text, news, blogs, or web pages data that need to be processed efficiently. For instance, in search engines today like Google, the top-ranked web pages relevant to a user query are shown with their titles, links, and the informative summaries (called snippets) of web pages in the most descriptive 20 to 30 words. That is, more accurate and efficient document summarization algorithms are highly demanding.

As there are several variants of document summarization, here we clarify the exact problem definition that we are going to deal with in this paper. First, there are two different tasks of summarization depending on the output forms: the extractive summary selects sentences from a document (i.e., a summary consists of copies of sentences from a document), while the abstractive summary aims to produce rephrased summary by understanding the key idea of the document in different words from those in the document. Of course, the latter task is more challenging and a long-term goal, and in this paper we focus on the extractive summary task. Also, the summary task can be either query-based or general. The former aims to generate a summary that are relevant to specific user queries, while the latter outputs a gist of a document in a general sense. In this paper we deal with the general extractive summary task.

Moreover, in the learning/optimization point of view, the summarization task can be categorized into either supervised or unsupervised. It is based on whether one is given an offline (training) data of pairs of summaries and documents. Despite the fact that collecting human supervised data (manual summaries) is very expansive, in current research it is often yet difficult to exploit the supervised data effectively. Several proposed supervised summarization algorithms often underperform unsupervised methods that do not require manually supervised summary data. For the very reason that collecting human summary supervision is expensive, in this paper we deal with the unsupervised learning setup.

Perhaps, the pioneering research in (unsupervised) document summarization is the work by Luhn [1], in which he introduced the so-called significance factor of a sentence. The significance factor is typically derived from the number of occurrences of the significant words in the sentence where the word significance is defined by weighted word occurrence scores or other statistical indexes [2]. Then the top ranked sentences in terms of the significance factor are selected as a summary sentence. The approach is quite simple, but so effective that it is comparable to even current state-of-the-arts.

More sophisticated summarization approaches have been proposed later on. Enumerating all the related work is impossible, and here we list a few important previous approaches. The probabilistic approaches in general aim to represent the statistical process of generating words in a sentence where the distributions are differentiated depending on the importance of a sentence. While the Naive Bayes model [3] assumes word-wise independency for computational simplicity, the restriction is relaxed to yield more flexible models by incorporating sequential dependency (i.e., word ordering in sentences): the hidden Markov models [4] or conditional log-linear models [5]. The discriminative approaches like neural network training [6] often results in more accurate summarization due to the rich representational capacity in possibly deep network architectures. However, these approaches are mostly based on supervised learning, suffering from the cost of collecting lots of labeled data (i.e., summaries manually given by humans). Even more recently, the sub-modular optimization techniques have been proposed [7] that greedily finds the most descriptive sentences.

In this paper we propose a fairly simple but very efficient algorithm for document summarization. We formulate a reasonable optimization problem that expresses the requirements of both faithful preservation of the document contents and the summary length constraint. The difficulty of integer programming originating from binary selection variables, is circumvented by real value relaxation and the low entropy penalization. The relaxed/approximated problem becomes an instance of a tractable convex-concave optimization, and we provide an efficient solution method that iteratively upper-bounds and solves the original problem. Tested on some real-world news document data, the proposed approach is shown to produce more plausible summaries than existing/baseline methods.

The rest of the paper is organized as follows. After introducing formal notations and discussing some baseline document summarization methods in Section 2, our main approach of convex-concave formulation of the problem is described in Section 3. The empirical results of the proposed method are provided in Section 4, followed by concluding remarks.

In this section we introduce some notations that are used throughout the paper, and describe several baseline unsupervised approaches that can yield reasonable document summarization results.

A document is comprised of words from the vocabulary set of size V. In text mining and natural language processing, it is typical to have a vector representation for a document (or sentence), and the most popular one is the so-called term-frequency (tf) vector that counts the frequency of each word that occurs in a document (or sentence). Formally, one denotes the tf vector by [tf1, tf2, …, tfV], a V-dim vector where tfk = the number of times the term tk occurs in the document (or sentence).

While the tf representation captures salient features about a document/sentence, it cares about the frequencies of words, treating every word equally important. This is counter-intuitive in that certain stop words (e.g., articles a or the) usually appear the most frequently, thus are considered the most significant. To avoid this drawback, one needs to discount the importance of such stop words by multiplying the so-called inverse document frequencies (idf) vector [idf1, idf2, …, idfV] ∈ ℝV where idfk=log(ndfk) and dfk = the number of training documents that contain the word tk among a set of n documents. This results in the tf-idf vector that has the tfk · idfk as the k-th entry. In the tf-idf representation, relatively unique terms (low dfk) are highly focused, while ubiquitous terms (high dfk) like stop words are effectively ignored.

Now we discuss several reasonable baseline approaches for document summarization. Although these approaches look simple, they often quite successful in producing plausible summaries of a document. In the extractive summary task, the goal is to select (at most) b sentences from a document that best describes the whole document. The first method is simply select the first b sentences from a document. The rationale is that often authors/writers tend to put their main themes or ideas at the beginning of a document. This approach is denoted by First-b.

The second approach is the significance factor method introduced in [1]. Specifically, we evaluate the tf-idf vector for the whole document to assign the importance score to each word. Then we evaluate the significance score for each sentence as the sum of the importance scores of the constituting words. Then we select b highest scored sentences as a summary. We denote this classical but quite successful approach by Luhn58.

For the third baseline, one can come up with a fairly reasonable probabilistic (e.g., Gaussian) density model for representing the sentence generation process. More formally, for a document comprised of m sentences, we let ai be the feature vector (e.g., tf or tf-idf) for the sentence i (i = 1, …, m). One can then consider an underlying Gaussian density model P(a) = (a; μ, ∑) from which the sentence feature vectors ai’s are generated independently. Under this modeling assumption, the model parameters μ and ∑ can be identified by maximum likelihood estimation (To avoid overfitting the Gaussian covariance might be restricted to be diagonal or isotropic), which simply results in sample mean and covariance (i.e., μ=1mi=1mai and Σ=1mi=1m(ai-μ)(ai-μ), where m′ can be either m (biased) or m − 1 (unbiased)). Once the model is estimated, the summary can be formed by selecting b sentences that have the highest likelihood scores P(ai) among i = 1, …, m. This approach is denoted by Gaussian.

In this section we propose our document summarization formulation. For a document comprised of m sentences, we assign the binary variable xi for each sentence i (i = 1, …, m), where xi = 1 (or 0) indicates that the sentence i is selected (or not) as a summary. The selection vector x is thus m-dimensional binary vector we should choose. The summary of the document is then represented as a feature vector contingent on x, denoted by Φ(x). If we use the term-frequency representation for each sentence, for instance, Φ(x) is the sum of the tf-vectors for the sentences selected as a summary (i.e., those with xi = 1). Naturally, selecting all sentences as summary, that is, having the summary feature Φ(e) where e is the m-dim vector with entries all 1, captures whole contents of the document. Of course, we usually have a length limit for a summary, say we are allowed to choose at most b sentences as a summary. Then the goal is to make the summary features as close as the full-document features Φ(e). This can be formulated as the following optimization:

minxΦ(e)-Φ(x)2s.t.   exb,   x{0,1}m.

In (1) the constraint exb encodes the summary length limit discussed before. It is also worth noting that while we employ the number of sentences as a summary length limit, one can incorporate more general budget constraint. Defining c as the m-dim vector of cost where ci is a cost of selecting the sentence i and letting b as a budget limit in general, the constraint cxb can be quite expressive. For instance, by having ci be the number of words in the sentence i and b be the word count limit for a summary, we obviously restrict the number of words in a summary by cxb. Although in (1) we rather use c = e to constrain the number of sentences (b) in a summary, our subsequent derivations apply straightforwardly to general situations with little modification.

However, the problem of (1), as can be reduced to the famous knapsack problem, is NP hard. We will propose a series of relaxation and approximation methods to yield a tractable optimization problem, followed by an efficient solution method. First the integer-valued x is relaxed to real valued in the interval [0, 1]. That is, instead of all-or-nothing hard selection of sentences, we do a sort of soft selection: xi close to 1 (0) means the sentence i is more likely to be selected, and vice versa. However, one needs to enforce the selection variables to have strong confidence in selection decision, namely having them close to 0 or 1, not around 0.5 which can incur ambiguity in final sentence selection stage. For this purpose we add the regularization term to the objective to penalize large entropy for each xi value. More specifically, our relaxed approximate optimization is formulated as follows.

minxΦ(e)-Φ(x)2+λi=1mH(xi),s.t.   exb,   0xe,

where the inequalities 0 ≤ xe in the constrains are elementwise. In the objective H(xi) = −xi log(xi)−(1−xi) log(1−xi) is the entropy of the selection confidence for the sentence i, which takes a small value for x close to either 0 or 1, and vice versa. The two terms in the objective are balanced by the trade-off parameter λ (≥ 0).

To be more concrete, notice that the first term in the objective is convex quadratic in x. If we use the tf vector representation (The same applies to any term weighting schemes such as tf-idf representation since we can define ai to be the product of the tf vector and the term weighting vector (e.g., idf)), by letting ai be the tf vector of the sentence i, we have the tf-vector of the whole summary sentences as: Φ(x)=i=1nxiai. Introducing A = [a1, …, am], the (V × m) matrix with ai’s in columns, allows more succinct notation Φ(x) = Ax. Thus the first term can be written as ||AeAx||2, which is obviously convex quadratic in x.

Furthermore, while tf vectors contain word counts in a standard tf-based treatment, for the issue of scale matching between Ae and Ax (the former usually larger than the latter due to the budget constraint), in practice it is often more effective to use the normalized tf vectors. In essence a tf vector is divided by the number of whole words to represent relative frequencies instead (i.e., the entries summed up to 1). To incorporate normalized features, we first define ai as a normalized tf (or tf-idf) vector, then define Φ(x)=1i=1mxii=1mxiai. This obviously makes the entries in Φ(x) summed up to 1. Here the denominator can be replaced by b considering the budget constraint to be tight (i.e., ex = b). There is no harm in this replacement since adding more sentences to a summary does not usually increases the objective (cost) of the feature mismatch. Similarly, the normalized tf vector for the entire document is simply the average of ai’s, namely dividing the sum of ai’s by m. In summary, we replace the first term in the objective of (2) by 1mAe-1bAx2 where A has normalized sentence-wise features as columns. The budget constraint is also changed to equality ex = b.

Now, we discuss how to solve the optimization problem. Although the constraints are all linear, (2) is overall non-convex due to the second entropy term which is concave in x. The objective is of the form of convex plus concave, and this type of optimization problem is often called the convex-concave programming. We take advantage of the iterative linearization technique for convex-concave programming [8]. Defining f(x)=1mAe-1bAx2 and g(x)=λ(i=1mxilog(xi)+(1-xi)log(1-xi)), our convex-concave optimization can be written as:

min   f(x)-g(x)   s.t.   ex=b,   0xe.

Note that f(x) and g(x) are both convex (but the objective is their difference, hence non-convex), and also the constraints are linear inequalities. The non-convexity originates from the subtraction of g(x), and our optimization strategy is iteratively approximating and solving the problem by linearizing g(x) around the previous iterate. We make it formalized below.

After iteration k where we have the iterate x(k), we approximate (f(x)−g(x)) as a convex function. Since f(x) is already convex, the concave −g(x) is convexified. As the best convex approximate of a concave function is affine (linear), we define the convexified objective as:

h(x)=f(x)-g(x(k))-g(x(k))(x-x(k)).

Here we used the first-order Taylor approximation for g(x) around x(k), and the approximate objective h(x) is obviously convex. Furthermore, h(x) is global upper bound of the original objective (f(x) − g(x)) due to the convexity of g(x). That is,

h(x)f(x)-g(x),x.

Now, we have the convexified approximate problem:

min   h(x)   s.t.   ex=b,   0xe,

which can be solved by any off-the-shelf convex optimization solver (e.g., the interior point method [9]). We denote the optimal solution of (6) by x(k+1). There are a couple of important things to note. First, since x(k) is the optimal solution for the approximate optimization in the previous iteration, x(k) is feasible (i.e., satisfying the inequality constraints). Secondly, as h(x(k)) = f(x(k)) − g(x(k)) from (4), h(x(k)) cannot be smaller than h(x(k+1)), the optimal value of (6). Combining it with (5), we have the following relations:

f(x(k))-g(x(k))=h(x(k))h(x(k+1)),f(x(k+1))-g(x(k+1)).

Since both x(k) and x(k+1) are feasible, (8) implies that we make improvement in the original objective value at every iteration. This guarantees that the iterations eventually converge to a local optimum of (3). We summarize the overall convex-concave solution algorithm in Algorithm 1, where we use [g(x)]i=λlogxi1-xi for i = 1, …, m. In addition, once the optimization is done, the final summary of b sentences is constructed by collecting sentences corresponding to top-b scorers in xi’s among i = 1, …, m.

Table 1. ROUGE-1 scores on the state union dataset

MethodsState union dataCulture data
CVXCAV0.56900.4966
First-b0.17150.1580
Luhn580.48450.4534
Gaussian0.42770.3793

We test the proposed document summarization algorithm on two text datasets with human summarization manually given. The brief descriptions of the datasets are summarized below.

  • State Union Dataset: This is the collection of about 200 documents from US presidents’ speeches. The vocabulary size is around 23, 000, among which we randomly select 20 documents for summarization. We manually select the most descriptive sentences that best describe the key themes of the documents.

  • Culture Dataset: The dataset is comprised of about 700 Internet articles on historic cultures such as architectures, celebrities, paintings, and so on. Each document is of length about 2–3 Kbyte, and the vocabulary size is around 22, 000. For each of the randomly chosen 40 documents, the key summary sentences are selected manually.

As a performance metric, we use the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) criterion [10] that is the most popular measure in the document summary literature. The ROUGE basically measures how many terms are overlapped between the retrieved summary and the human selected, and the ROUGE-1 measure we are going to use in this section is specifically defined as follows.

ROUGE-1=tSmin(#(t,X),#(t,S))tS#(t,S),

where S is the reference summary (i.e., the set of sentences selected by human), and X is the set retrieved by a summarization algorithm. Also, #(t,C) indicates the counts of the term t in the sentence set C. Note that the denominator is the sum of occurrences of all terms in the reference summary, and obviously the larger ROUGE-1 score is the better.

For the above two datasets, we compare the baseline approaches ( First-b, Luhn58, and Gaussian) as described in Section 2, with our convex-concave optimization approach (denoted by CVXCAV). The summary budget constraint b is set to 3, and we choose empirically and fix the balancing trade-off parameter λ = 0.01. The ROUGE-1 scores are depicted in Table 1. Our convex-concave optimization approach performs the best for both datasets, which can be mainly attributed to our principled formulation of the ultimate goal of faithful representation of the entire document term distribution within the given budget constraint. Moreover, the convex-concave approximate optimization method appears to be quite effective in finding viable relaxed solutions.

In this paper we have proposed a novel optimization problem formulation for the unsupervised document summarization tasks. The objective function deals with the word distribution mismatch between the whole document and the summary, while the entropy penalizing term encourages the strong confidence in sentence selection. The proposed convex-concave optimization approach is not only guaranteed to converge theoretically, but also performs well in practice as shown on several real-world datasets. Extensions of the approach to the supervised learning setup and the abstractive summarization tasks remain as future work.

  1. Luhn, HP (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development. 2, 159-165.
    CrossRef
  2. Lin, CY, and Hovy, E 2000. The automated acquisition of topic signatures for text summarization., Proceedings of the 18th Conference on Computational Linguistics, Saarbrucken, Germany, Array, pp.495-501.
  3. Kupiec, J, Pedersen, J, and Chen, F 1995. A trainable document summarizer., Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development In Information Retrieval, Seattle, WA, Array, pp.68-73.
  4. Conroy, JM, and O’leary, DP 2001. Text summarization via hidden Markov models., Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, Array, pp.406-407.
  5. Osborne, M 2002. Using maximum entropy for sentence extraction., Proceedings of the ACL-02 Workshop on Automatic Summarization, Philadelphia, PA, Array, pp.1-8.
  6. Svore, KM, Vanderwende, L, and Burges, CJC 2007. Enhancing single-document summarization by combining RankNet and third-party sources., Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech, pp.448-457.
  7. Lin, H, and Bilmes, J 2011. A class of submodular functions for document summarization., Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, pp.510-520.
  8. Yuille, AL, and Rangarajan, A (2003). The concave-convex procedure. Neural Computation. 15, 915-936.
    Pubmed CrossRef
  9. Ye, Y (1997). Interior point algorithms: theory and analysis. New York: John Wiley & Sons
    CrossRef
  10. Lin, CY, and Hovy, E 2003. Automatic evaluation of summaries using N-gram co-occurrence statistics., Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Edmonton, Canada, Array, pp.71-78.

Minyoung Kim received his BS and MS degrees both in Computer Science and Engineering from Seoul National University, South Korea. He earned a PhD degree in Computer Science from Rutgers University in 2008. From 2009 to 2010 he was a postdoctoral researcher at the Robotics Institute of Carnegie Mellon University. He is currently an associate professor in Department of Electronics and IT Media Engineering at Seoul National University of Science and Technology in Korea. His primary research interest is machine learning and computer vision. His research focus includes graphical models, motion estimation/tracking, discriminative models/learning, kernel methods, and dimensionality reduction.

E-mail: mikim@seoultech.ac.kr

Article

Original Article

Int. J. Fuzzy Log. Intell. Syst. 2016; 16(4): 293-298

Published online December 25, 2016 https://doi.org/10.5391/IJFIS.2016.16.4.293

Copyright © The Korean Institute of Intelligent Systems.

Document Summarization via Convex-Concave Programming

Minyoung Kim

Department of Electronics & IT Media Engineering, Seoul National University of Science & Technology, Seoul, Korea

Correspondence to:Minyoung Kim, (mikim@seoultech.ac.kr)

Received: November 21, 2016; Revised: December 12, 2016; Accepted: December 13, 2016

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Document summarization is an important task in various areas where the goal is to select a few the most descriptive sentences from a given document as a succinct summary. Even without training data of human labeled summaries, there has been several interesting existing work in the literature that yields reasonable performance. In this paper, within the same unsupervised learning setup, we propose a more principled learning framework for the document summarization task. Specifically we formulate an optimization problem that expresses the requirements of both faithful preservation of the document contents and the summary length constraint. We circumvent the difficult integer programming originating from binary sentence selection via continuous relaxation and the low entropy penalization. We also suggest an efficient convex-concave optimization solver algorithm that guarantees to improve the original objective at every iteration. For several document datasets, we demonstrate that the proposed learning algorithm significantly outperforms the existing approaches.

Keywords: Document summarization, Natural language processing, Text mining, Optimization

1. Introduction

Document summarization is the task of automatic generation of a succinct summary or gist from a document or multiple related documents. While it has a relatively long (more than 50 years) history of research in natural language processing (NLP), text mining, and related fields, the document summarization has received unprecedented attention recently due to the enormous amount of text, news, blogs, or web pages data that need to be processed efficiently. For instance, in search engines today like Google, the top-ranked web pages relevant to a user query are shown with their titles, links, and the informative summaries (called snippets) of web pages in the most descriptive 20 to 30 words. That is, more accurate and efficient document summarization algorithms are highly demanding.

As there are several variants of document summarization, here we clarify the exact problem definition that we are going to deal with in this paper. First, there are two different tasks of summarization depending on the output forms: the extractive summary selects sentences from a document (i.e., a summary consists of copies of sentences from a document), while the abstractive summary aims to produce rephrased summary by understanding the key idea of the document in different words from those in the document. Of course, the latter task is more challenging and a long-term goal, and in this paper we focus on the extractive summary task. Also, the summary task can be either query-based or general. The former aims to generate a summary that are relevant to specific user queries, while the latter outputs a gist of a document in a general sense. In this paper we deal with the general extractive summary task.

Moreover, in the learning/optimization point of view, the summarization task can be categorized into either supervised or unsupervised. It is based on whether one is given an offline (training) data of pairs of summaries and documents. Despite the fact that collecting human supervised data (manual summaries) is very expansive, in current research it is often yet difficult to exploit the supervised data effectively. Several proposed supervised summarization algorithms often underperform unsupervised methods that do not require manually supervised summary data. For the very reason that collecting human summary supervision is expensive, in this paper we deal with the unsupervised learning setup.

Perhaps, the pioneering research in (unsupervised) document summarization is the work by Luhn [1], in which he introduced the so-called significance factor of a sentence. The significance factor is typically derived from the number of occurrences of the significant words in the sentence where the word significance is defined by weighted word occurrence scores or other statistical indexes [2]. Then the top ranked sentences in terms of the significance factor are selected as a summary sentence. The approach is quite simple, but so effective that it is comparable to even current state-of-the-arts.

More sophisticated summarization approaches have been proposed later on. Enumerating all the related work is impossible, and here we list a few important previous approaches. The probabilistic approaches in general aim to represent the statistical process of generating words in a sentence where the distributions are differentiated depending on the importance of a sentence. While the Naive Bayes model [3] assumes word-wise independency for computational simplicity, the restriction is relaxed to yield more flexible models by incorporating sequential dependency (i.e., word ordering in sentences): the hidden Markov models [4] or conditional log-linear models [5]. The discriminative approaches like neural network training [6] often results in more accurate summarization due to the rich representational capacity in possibly deep network architectures. However, these approaches are mostly based on supervised learning, suffering from the cost of collecting lots of labeled data (i.e., summaries manually given by humans). Even more recently, the sub-modular optimization techniques have been proposed [7] that greedily finds the most descriptive sentences.

In this paper we propose a fairly simple but very efficient algorithm for document summarization. We formulate a reasonable optimization problem that expresses the requirements of both faithful preservation of the document contents and the summary length constraint. The difficulty of integer programming originating from binary selection variables, is circumvented by real value relaxation and the low entropy penalization. The relaxed/approximated problem becomes an instance of a tractable convex-concave optimization, and we provide an efficient solution method that iteratively upper-bounds and solves the original problem. Tested on some real-world news document data, the proposed approach is shown to produce more plausible summaries than existing/baseline methods.

The rest of the paper is organized as follows. After introducing formal notations and discussing some baseline document summarization methods in Section 2, our main approach of convex-concave formulation of the problem is described in Section 3. The empirical results of the proposed method are provided in Section 4, followed by concluding remarks.

2. Baselines for Document Summarization

In this section we introduce some notations that are used throughout the paper, and describe several baseline unsupervised approaches that can yield reasonable document summarization results.

A document is comprised of words from the vocabulary set of size V. In text mining and natural language processing, it is typical to have a vector representation for a document (or sentence), and the most popular one is the so-called term-frequency (tf) vector that counts the frequency of each word that occurs in a document (or sentence). Formally, one denotes the tf vector by [tf1, tf2, …, tfV], a V-dim vector where tfk = the number of times the term tk occurs in the document (or sentence).

While the tf representation captures salient features about a document/sentence, it cares about the frequencies of words, treating every word equally important. This is counter-intuitive in that certain stop words (e.g., articles a or the) usually appear the most frequently, thus are considered the most significant. To avoid this drawback, one needs to discount the importance of such stop words by multiplying the so-called inverse document frequencies (idf) vector [idf1, idf2, …, idfV] ∈ ℝV where idfk=log(ndfk) and dfk = the number of training documents that contain the word tk among a set of n documents. This results in the tf-idf vector that has the tfk · idfk as the k-th entry. In the tf-idf representation, relatively unique terms (low dfk) are highly focused, while ubiquitous terms (high dfk) like stop words are effectively ignored.

Now we discuss several reasonable baseline approaches for document summarization. Although these approaches look simple, they often quite successful in producing plausible summaries of a document. In the extractive summary task, the goal is to select (at most) b sentences from a document that best describes the whole document. The first method is simply select the first b sentences from a document. The rationale is that often authors/writers tend to put their main themes or ideas at the beginning of a document. This approach is denoted by First-b.

The second approach is the significance factor method introduced in [1]. Specifically, we evaluate the tf-idf vector for the whole document to assign the importance score to each word. Then we evaluate the significance score for each sentence as the sum of the importance scores of the constituting words. Then we select b highest scored sentences as a summary. We denote this classical but quite successful approach by Luhn58.

For the third baseline, one can come up with a fairly reasonable probabilistic (e.g., Gaussian) density model for representing the sentence generation process. More formally, for a document comprised of m sentences, we let ai be the feature vector (e.g., tf or tf-idf) for the sentence i (i = 1, …, m). One can then consider an underlying Gaussian density model P(a) = (a; μ, ∑) from which the sentence feature vectors ai’s are generated independently. Under this modeling assumption, the model parameters μ and ∑ can be identified by maximum likelihood estimation (To avoid overfitting the Gaussian covariance might be restricted to be diagonal or isotropic), which simply results in sample mean and covariance (i.e., μ=1mi=1mai and Σ=1mi=1m(ai-μ)(ai-μ), where m′ can be either m (biased) or m − 1 (unbiased)). Once the model is estimated, the summary can be formed by selecting b sentences that have the highest likelihood scores P(ai) among i = 1, …, m. This approach is denoted by Gaussian.

3. Our Approach

In this section we propose our document summarization formulation. For a document comprised of m sentences, we assign the binary variable xi for each sentence i (i = 1, …, m), where xi = 1 (or 0) indicates that the sentence i is selected (or not) as a summary. The selection vector x is thus m-dimensional binary vector we should choose. The summary of the document is then represented as a feature vector contingent on x, denoted by Φ(x). If we use the term-frequency representation for each sentence, for instance, Φ(x) is the sum of the tf-vectors for the sentences selected as a summary (i.e., those with xi = 1). Naturally, selecting all sentences as summary, that is, having the summary feature Φ(e) where e is the m-dim vector with entries all 1, captures whole contents of the document. Of course, we usually have a length limit for a summary, say we are allowed to choose at most b sentences as a summary. Then the goal is to make the summary features as close as the full-document features Φ(e). This can be formulated as the following optimization:

minxΦ(e)-Φ(x)2s.t.   exb,   x{0,1}m.

In (1) the constraint exb encodes the summary length limit discussed before. It is also worth noting that while we employ the number of sentences as a summary length limit, one can incorporate more general budget constraint. Defining c as the m-dim vector of cost where ci is a cost of selecting the sentence i and letting b as a budget limit in general, the constraint cxb can be quite expressive. For instance, by having ci be the number of words in the sentence i and b be the word count limit for a summary, we obviously restrict the number of words in a summary by cxb. Although in (1) we rather use c = e to constrain the number of sentences (b) in a summary, our subsequent derivations apply straightforwardly to general situations with little modification.

However, the problem of (1), as can be reduced to the famous knapsack problem, is NP hard. We will propose a series of relaxation and approximation methods to yield a tractable optimization problem, followed by an efficient solution method. First the integer-valued x is relaxed to real valued in the interval [0, 1]. That is, instead of all-or-nothing hard selection of sentences, we do a sort of soft selection: xi close to 1 (0) means the sentence i is more likely to be selected, and vice versa. However, one needs to enforce the selection variables to have strong confidence in selection decision, namely having them close to 0 or 1, not around 0.5 which can incur ambiguity in final sentence selection stage. For this purpose we add the regularization term to the objective to penalize large entropy for each xi value. More specifically, our relaxed approximate optimization is formulated as follows.

minxΦ(e)-Φ(x)2+λi=1mH(xi),s.t.   exb,   0xe,

where the inequalities 0 ≤ xe in the constrains are elementwise. In the objective H(xi) = −xi log(xi)−(1−xi) log(1−xi) is the entropy of the selection confidence for the sentence i, which takes a small value for x close to either 0 or 1, and vice versa. The two terms in the objective are balanced by the trade-off parameter λ (≥ 0).

To be more concrete, notice that the first term in the objective is convex quadratic in x. If we use the tf vector representation (The same applies to any term weighting schemes such as tf-idf representation since we can define ai to be the product of the tf vector and the term weighting vector (e.g., idf)), by letting ai be the tf vector of the sentence i, we have the tf-vector of the whole summary sentences as: Φ(x)=i=1nxiai. Introducing A = [a1, …, am], the (V × m) matrix with ai’s in columns, allows more succinct notation Φ(x) = Ax. Thus the first term can be written as ||AeAx||2, which is obviously convex quadratic in x.

Furthermore, while tf vectors contain word counts in a standard tf-based treatment, for the issue of scale matching between Ae and Ax (the former usually larger than the latter due to the budget constraint), in practice it is often more effective to use the normalized tf vectors. In essence a tf vector is divided by the number of whole words to represent relative frequencies instead (i.e., the entries summed up to 1). To incorporate normalized features, we first define ai as a normalized tf (or tf-idf) vector, then define Φ(x)=1i=1mxii=1mxiai. This obviously makes the entries in Φ(x) summed up to 1. Here the denominator can be replaced by b considering the budget constraint to be tight (i.e., ex = b). There is no harm in this replacement since adding more sentences to a summary does not usually increases the objective (cost) of the feature mismatch. Similarly, the normalized tf vector for the entire document is simply the average of ai’s, namely dividing the sum of ai’s by m. In summary, we replace the first term in the objective of (2) by 1mAe-1bAx2 where A has normalized sentence-wise features as columns. The budget constraint is also changed to equality ex = b.

Now, we discuss how to solve the optimization problem. Although the constraints are all linear, (2) is overall non-convex due to the second entropy term which is concave in x. The objective is of the form of convex plus concave, and this type of optimization problem is often called the convex-concave programming. We take advantage of the iterative linearization technique for convex-concave programming [8]. Defining f(x)=1mAe-1bAx2 and g(x)=λ(i=1mxilog(xi)+(1-xi)log(1-xi)), our convex-concave optimization can be written as:

min   f(x)-g(x)   s.t.   ex=b,   0xe.

Note that f(x) and g(x) are both convex (but the objective is their difference, hence non-convex), and also the constraints are linear inequalities. The non-convexity originates from the subtraction of g(x), and our optimization strategy is iteratively approximating and solving the problem by linearizing g(x) around the previous iterate. We make it formalized below.

After iteration k where we have the iterate x(k), we approximate (f(x)−g(x)) as a convex function. Since f(x) is already convex, the concave −g(x) is convexified. As the best convex approximate of a concave function is affine (linear), we define the convexified objective as:

h(x)=f(x)-g(x(k))-g(x(k))(x-x(k)).

Here we used the first-order Taylor approximation for g(x) around x(k), and the approximate objective h(x) is obviously convex. Furthermore, h(x) is global upper bound of the original objective (f(x) − g(x)) due to the convexity of g(x). That is,

h(x)f(x)-g(x),x.

Now, we have the convexified approximate problem:

min   h(x)   s.t.   ex=b,   0xe,

which can be solved by any off-the-shelf convex optimization solver (e.g., the interior point method [9]). We denote the optimal solution of (6) by x(k+1). There are a couple of important things to note. First, since x(k) is the optimal solution for the approximate optimization in the previous iteration, x(k) is feasible (i.e., satisfying the inequality constraints). Secondly, as h(x(k)) = f(x(k)) − g(x(k)) from (4), h(x(k)) cannot be smaller than h(x(k+1)), the optimal value of (6). Combining it with (5), we have the following relations:

f(x(k))-g(x(k))=h(x(k))h(x(k+1)),f(x(k+1))-g(x(k+1)).

Since both x(k) and x(k+1) are feasible, (8) implies that we make improvement in the original objective value at every iteration. This guarantees that the iterations eventually converge to a local optimum of (3). We summarize the overall convex-concave solution algorithm in Algorithm 1, where we use [g(x)]i=λlogxi1-xi for i = 1, …, m. In addition, once the optimization is done, the final summary of b sentences is constructed by collecting sentences corresponding to top-b scorers in xi’s among i = 1, …, m.

Table 1. ROUGE-1 scores on the state union dataset.

MethodsState union dataCulture data
CVXCAV0.56900.4966
First-b0.17150.1580
Luhn580.48450.4534
Gaussian0.42770.3793

4. Empirical Evaluations

We test the proposed document summarization algorithm on two text datasets with human summarization manually given. The brief descriptions of the datasets are summarized below.

  • State Union Dataset: This is the collection of about 200 documents from US presidents’ speeches. The vocabulary size is around 23, 000, among which we randomly select 20 documents for summarization. We manually select the most descriptive sentences that best describe the key themes of the documents.

  • Culture Dataset: The dataset is comprised of about 700 Internet articles on historic cultures such as architectures, celebrities, paintings, and so on. Each document is of length about 2–3 Kbyte, and the vocabulary size is around 22, 000. For each of the randomly chosen 40 documents, the key summary sentences are selected manually.

As a performance metric, we use the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) criterion [10] that is the most popular measure in the document summary literature. The ROUGE basically measures how many terms are overlapped between the retrieved summary and the human selected, and the ROUGE-1 measure we are going to use in this section is specifically defined as follows.

ROUGE-1=tSmin(#(t,X),#(t,S))tS#(t,S),

where S is the reference summary (i.e., the set of sentences selected by human), and X is the set retrieved by a summarization algorithm. Also, #(t,C) indicates the counts of the term t in the sentence set C. Note that the denominator is the sum of occurrences of all terms in the reference summary, and obviously the larger ROUGE-1 score is the better.

For the above two datasets, we compare the baseline approaches ( First-b, Luhn58, and Gaussian) as described in Section 2, with our convex-concave optimization approach (denoted by CVXCAV). The summary budget constraint b is set to 3, and we choose empirically and fix the balancing trade-off parameter λ = 0.01. The ROUGE-1 scores are depicted in Table 1. Our convex-concave optimization approach performs the best for both datasets, which can be mainly attributed to our principled formulation of the ultimate goal of faithful representation of the entire document term distribution within the given budget constraint. Moreover, the convex-concave approximate optimization method appears to be quite effective in finding viable relaxed solutions.

5. Conclusion

In this paper we have proposed a novel optimization problem formulation for the unsupervised document summarization tasks. The objective function deals with the word distribution mismatch between the whole document and the summary, while the entropy penalizing term encourages the strong confidence in sentence selection. The proposed convex-concave optimization approach is not only guaranteed to converge theoretically, but also performs well in practice as shown on several real-world datasets. Extensions of the approach to the supervised learning setup and the abstractive summarization tasks remain as future work.

Acknowledgements

This study was supported by Seoul National University of Science & Technology.

Algorithm 1. Document summarization via convex-concave programming.

Input:A = Normalized sentence features (e.g., tf-idf) for a document, b = the number of sentences to be selected as a summary.
Output:b sentences as a summary.
Randomly choose x(0) from the feasible set {x: ex = b, 0 ≤ xe} as an initial iterate.
Repeat for k = 0, 1, … until convergence:
 Solve the convex optimization:
x(k+1)=arg minx1mAe-1bAx2-λi=1mlogxi(k)1-xi(k)xis.t.   ex=b,   0xe.
The final solution denoted by xopt.
Take b sentences corresponding to the b largest xiopt’s among i = 1, …, m.

Table 1. ROUGE-1 scores on the state union dataset.

MethodsState union dataCulture data
CVXCAV0.56900.4966
First-b0.17150.1580
Luhn580.48450.4534
Gaussian0.42770.3793

References

  1. Luhn, HP (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development. 2, 159-165.
    CrossRef
  2. Lin, CY, and Hovy, E 2000. The automated acquisition of topic signatures for text summarization., Proceedings of the 18th Conference on Computational Linguistics, Saarbrucken, Germany, Array, pp.495-501.
  3. Kupiec, J, Pedersen, J, and Chen, F 1995. A trainable document summarizer., Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development In Information Retrieval, Seattle, WA, Array, pp.68-73.
  4. Conroy, JM, and O’leary, DP 2001. Text summarization via hidden Markov models., Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, Array, pp.406-407.
  5. Osborne, M 2002. Using maximum entropy for sentence extraction., Proceedings of the ACL-02 Workshop on Automatic Summarization, Philadelphia, PA, Array, pp.1-8.
  6. Svore, KM, Vanderwende, L, and Burges, CJC 2007. Enhancing single-document summarization by combining RankNet and third-party sources., Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech, pp.448-457.
  7. Lin, H, and Bilmes, J 2011. A class of submodular functions for document summarization., Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, pp.510-520.
  8. Yuille, AL, and Rangarajan, A (2003). The concave-convex procedure. Neural Computation. 15, 915-936.
    Pubmed CrossRef
  9. Ye, Y (1997). Interior point algorithms: theory and analysis. New York: John Wiley & Sons
    CrossRef
  10. Lin, CY, and Hovy, E 2003. Automatic evaluation of summaries using N-gram co-occurrence statistics., Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Edmonton, Canada, Array, pp.71-78.

Share this article on :

Related articles in IJFIS