Int. J. Fuzzy Log. Intell. Syst. 2016; 16(4): 293-298
Published online December 25, 2016
https://doi.org/10.5391/IJFIS.2016.16.4.293
© The Korean Institute of Intelligent Systems
Minyoung Kim
Department of Electronics & IT Media Engineering, Seoul National University of Science & Technology, Seoul, Korea
Correspondence to :
Minyoung Kim, (mikim@seoultech.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Document summarization is an important task in various areas where the goal is to select a few the most descriptive sentences from a given document as a succinct summary. Even without training data of human labeled summaries, there has been several interesting existing work in the literature that yields reasonable performance. In this paper, within the same unsupervised learning setup, we propose a more principled learning framework for the document summarization task. Specifically we formulate an optimization problem that expresses the requirements of both faithful preservation of the document contents and the summary length constraint. We circumvent the difficult integer programming originating from binary sentence selection via continuous relaxation and the low entropy penalization. We also suggest an efficient convex-concave optimization solver algorithm that guarantees to improve the original objective at every iteration. For several document datasets, we demonstrate that the proposed learning algorithm significantly outperforms the existing approaches.
Keywords: Document summarization, Natural language processing, Text mining, Optimization
Document summarization is the task of automatic generation of a succinct summary or gist from a document or multiple related documents. While it has a relatively long (more than 50 years) history of research in natural language processing (NLP), text mining, and related fields, the document summarization has received unprecedented attention recently due to the enormous amount of text, news, blogs, or web pages data that need to be processed efficiently. For instance, in search engines today like Google, the top-ranked web pages relevant to a user query are shown with their titles, links, and the informative summaries (called snippets) of web pages in the most descriptive 20 to 30 words. That is, more accurate and efficient document summarization algorithms are highly demanding.
As there are several variants of document summarization, here we clarify the exact problem definition that we are going to deal with in this paper. First, there are two different tasks of summarization depending on the output forms: the
Moreover, in the learning/optimization point of view, the summarization task can be categorized into either supervised or unsupervised. It is based on whether one is given an offline (training) data of pairs of summaries and documents. Despite the fact that collecting human supervised data (manual summaries) is very expansive, in current research it is often yet difficult to exploit the supervised data effectively. Several proposed supervised summarization algorithms often underperform unsupervised methods that do not require manually supervised summary data. For the very reason that collecting human summary supervision is expensive, in this paper we deal with the unsupervised learning setup.
Perhaps, the pioneering research in (unsupervised) document summarization is the work by Luhn [1], in which he introduced the so-called
More sophisticated summarization approaches have been proposed later on. Enumerating all the related work is impossible, and here we list a few important previous approaches. The probabilistic approaches in general aim to represent the statistical process of generating words in a sentence where the distributions are differentiated depending on the importance of a sentence. While the Naive Bayes model [3] assumes word-wise independency for computational simplicity, the restriction is relaxed to yield more flexible models by incorporating sequential dependency (i.e., word ordering in sentences): the hidden Markov models [4] or conditional log-linear models [5]. The discriminative approaches like neural network training [6] often results in more accurate summarization due to the rich representational capacity in possibly deep network architectures. However, these approaches are mostly based on supervised learning, suffering from the cost of collecting lots of labeled data (i.e., summaries manually given by humans). Even more recently, the sub-modular optimization techniques have been proposed [7] that greedily finds the most descriptive sentences.
In this paper we propose a fairly simple but very efficient algorithm for document summarization. We formulate a reasonable optimization problem that expresses the requirements of both faithful preservation of the document contents and the summary length constraint. The difficulty of integer programming originating from binary selection variables, is circumvented by real value relaxation and the low entropy penalization. The relaxed/approximated problem becomes an instance of a tractable convex-concave optimization, and we provide an efficient solution method that iteratively upper-bounds and solves the original problem. Tested on some real-world news document data, the proposed approach is shown to produce more plausible summaries than existing/baseline methods.
The rest of the paper is organized as follows. After introducing formal notations and discussing some baseline document summarization methods in Section 2, our main approach of convex-concave formulation of the problem is described in Section 3. The empirical results of the proposed method are provided in Section 4, followed by concluding remarks.
In this section we introduce some notations that are used throughout the paper, and describe several baseline unsupervised approaches that can yield reasonable document summarization results.
A document is comprised of words from the vocabulary set of size
While the tf representation captures salient features about a document/sentence, it cares about the frequencies of words, treating every word equally important. This is counter-intuitive in that certain stop words (e.g., articles
Now we discuss several reasonable baseline approaches for document summarization. Although these approaches look simple, they often quite successful in producing plausible summaries of a document. In the extractive summary task, the goal is to select (at most)
The second approach is the significance factor method introduced in [1]. Specifically, we evaluate the tf-idf vector for the whole document to assign the importance score to each word. Then we evaluate the significance score for each sentence as the sum of the importance scores of the constituting words. Then we select
For the third baseline, one can come up with a fairly reasonable probabilistic (e.g., Gaussian) density model for representing the sentence generation process. More formally, for a document comprised of
In this section we propose our document summarization formulation. For a document comprised of
In (
However, the problem of (
where the inequalities 0 ≤
To be more concrete, notice that the first term in the objective is convex quadratic in
Furthermore, while tf vectors contain word counts in a standard tf-based treatment, for the issue of scale matching between
Now, we discuss how to solve the optimization problem. Although the constraints are all linear, (
Note that
After iteration
Here we used the first-order Taylor approximation for
Now, we have the convexified approximate problem:
which can be solved by any off-the-shelf convex optimization solver (e.g., the interior point method [9]). We denote the optimal solution of (
Since both
Table 1. ROUGE-1 scores on the state union dataset
Methods | State union data | Culture data |
---|---|---|
0.5690 | 0.4966 | |
0.1715 | 0.1580 | |
0.4845 | 0.4534 | |
0.4277 | 0.3793 |
We test the proposed document summarization algorithm on two text datasets with human summarization manually given. The brief descriptions of the datasets are summarized below.
As a performance metric, we use the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) criterion [10] that is the most popular measure in the document summary literature. The ROUGE basically measures how many terms are overlapped between the retrieved summary and the human selected, and the ROUGE-1 measure we are going to use in this section is specifically defined as follows.
where
For the above two datasets, we compare the baseline approaches (
In this paper we have proposed a novel optimization problem formulation for the unsupervised document summarization tasks. The objective function deals with the word distribution mismatch between the whole document and the summary, while the entropy penalizing term encourages the strong confidence in sentence selection. The proposed convex-concave optimization approach is not only guaranteed to converge theoretically, but also performs well in practice as shown on several real-world datasets. Extensions of the approach to the supervised learning setup and the abstractive summarization tasks remain as future work.
This study was supported by Seoul National University of Science & Technology.
No potential conflict of interest relevant to this article was reported.
E-mail: mikim@seoultech.ac.kr
Int. J. Fuzzy Log. Intell. Syst. 2016; 16(4): 293-298
Published online December 25, 2016 https://doi.org/10.5391/IJFIS.2016.16.4.293
Copyright © The Korean Institute of Intelligent Systems.
Minyoung Kim
Department of Electronics & IT Media Engineering, Seoul National University of Science & Technology, Seoul, Korea
Correspondence to:Minyoung Kim, (mikim@seoultech.ac.kr)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Document summarization is an important task in various areas where the goal is to select a few the most descriptive sentences from a given document as a succinct summary. Even without training data of human labeled summaries, there has been several interesting existing work in the literature that yields reasonable performance. In this paper, within the same unsupervised learning setup, we propose a more principled learning framework for the document summarization task. Specifically we formulate an optimization problem that expresses the requirements of both faithful preservation of the document contents and the summary length constraint. We circumvent the difficult integer programming originating from binary sentence selection via continuous relaxation and the low entropy penalization. We also suggest an efficient convex-concave optimization solver algorithm that guarantees to improve the original objective at every iteration. For several document datasets, we demonstrate that the proposed learning algorithm significantly outperforms the existing approaches.
Keywords: Document summarization, Natural language processing, Text mining, Optimization
Document summarization is the task of automatic generation of a succinct summary or gist from a document or multiple related documents. While it has a relatively long (more than 50 years) history of research in natural language processing (NLP), text mining, and related fields, the document summarization has received unprecedented attention recently due to the enormous amount of text, news, blogs, or web pages data that need to be processed efficiently. For instance, in search engines today like Google, the top-ranked web pages relevant to a user query are shown with their titles, links, and the informative summaries (called snippets) of web pages in the most descriptive 20 to 30 words. That is, more accurate and efficient document summarization algorithms are highly demanding.
As there are several variants of document summarization, here we clarify the exact problem definition that we are going to deal with in this paper. First, there are two different tasks of summarization depending on the output forms: the
Moreover, in the learning/optimization point of view, the summarization task can be categorized into either supervised or unsupervised. It is based on whether one is given an offline (training) data of pairs of summaries and documents. Despite the fact that collecting human supervised data (manual summaries) is very expansive, in current research it is often yet difficult to exploit the supervised data effectively. Several proposed supervised summarization algorithms often underperform unsupervised methods that do not require manually supervised summary data. For the very reason that collecting human summary supervision is expensive, in this paper we deal with the unsupervised learning setup.
Perhaps, the pioneering research in (unsupervised) document summarization is the work by Luhn [1], in which he introduced the so-called
More sophisticated summarization approaches have been proposed later on. Enumerating all the related work is impossible, and here we list a few important previous approaches. The probabilistic approaches in general aim to represent the statistical process of generating words in a sentence where the distributions are differentiated depending on the importance of a sentence. While the Naive Bayes model [3] assumes word-wise independency for computational simplicity, the restriction is relaxed to yield more flexible models by incorporating sequential dependency (i.e., word ordering in sentences): the hidden Markov models [4] or conditional log-linear models [5]. The discriminative approaches like neural network training [6] often results in more accurate summarization due to the rich representational capacity in possibly deep network architectures. However, these approaches are mostly based on supervised learning, suffering from the cost of collecting lots of labeled data (i.e., summaries manually given by humans). Even more recently, the sub-modular optimization techniques have been proposed [7] that greedily finds the most descriptive sentences.
In this paper we propose a fairly simple but very efficient algorithm for document summarization. We formulate a reasonable optimization problem that expresses the requirements of both faithful preservation of the document contents and the summary length constraint. The difficulty of integer programming originating from binary selection variables, is circumvented by real value relaxation and the low entropy penalization. The relaxed/approximated problem becomes an instance of a tractable convex-concave optimization, and we provide an efficient solution method that iteratively upper-bounds and solves the original problem. Tested on some real-world news document data, the proposed approach is shown to produce more plausible summaries than existing/baseline methods.
The rest of the paper is organized as follows. After introducing formal notations and discussing some baseline document summarization methods in Section 2, our main approach of convex-concave formulation of the problem is described in Section 3. The empirical results of the proposed method are provided in Section 4, followed by concluding remarks.
In this section we introduce some notations that are used throughout the paper, and describe several baseline unsupervised approaches that can yield reasonable document summarization results.
A document is comprised of words from the vocabulary set of size
While the tf representation captures salient features about a document/sentence, it cares about the frequencies of words, treating every word equally important. This is counter-intuitive in that certain stop words (e.g., articles
Now we discuss several reasonable baseline approaches for document summarization. Although these approaches look simple, they often quite successful in producing plausible summaries of a document. In the extractive summary task, the goal is to select (at most)
The second approach is the significance factor method introduced in [1]. Specifically, we evaluate the tf-idf vector for the whole document to assign the importance score to each word. Then we evaluate the significance score for each sentence as the sum of the importance scores of the constituting words. Then we select
For the third baseline, one can come up with a fairly reasonable probabilistic (e.g., Gaussian) density model for representing the sentence generation process. More formally, for a document comprised of
In this section we propose our document summarization formulation. For a document comprised of
In (
However, the problem of (
where the inequalities 0 ≤
To be more concrete, notice that the first term in the objective is convex quadratic in
Furthermore, while tf vectors contain word counts in a standard tf-based treatment, for the issue of scale matching between
Now, we discuss how to solve the optimization problem. Although the constraints are all linear, (
Note that
After iteration
Here we used the first-order Taylor approximation for
Now, we have the convexified approximate problem:
which can be solved by any off-the-shelf convex optimization solver (e.g., the interior point method [9]). We denote the optimal solution of (
Since both
Table 1. ROUGE-1 scores on the state union dataset.
Methods | State union data | Culture data |
---|---|---|
0.5690 | 0.4966 | |
0.1715 | 0.1580 | |
0.4845 | 0.4534 | |
0.4277 | 0.3793 |
We test the proposed document summarization algorithm on two text datasets with human summarization manually given. The brief descriptions of the datasets are summarized below.
As a performance metric, we use the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) criterion [10] that is the most popular measure in the document summary literature. The ROUGE basically measures how many terms are overlapped between the retrieved summary and the human selected, and the ROUGE-1 measure we are going to use in this section is specifically defined as follows.
where
For the above two datasets, we compare the baseline approaches (
In this paper we have proposed a novel optimization problem formulation for the unsupervised document summarization tasks. The objective function deals with the word distribution mismatch between the whole document and the summary, while the entropy penalizing term encourages the strong confidence in sentence selection. The proposed convex-concave optimization approach is not only guaranteed to converge theoretically, but also performs well in practice as shown on several real-world datasets. Extensions of the approach to the supervised learning setup and the abstractive summarization tasks remain as future work.
This study was supported by Seoul National University of Science & Technology.
Algorithm 1. Document summarization via convex-concave programming.
Randomly choose |
Repeat for |
Solve the convex optimization: |
The final solution denoted by |
Take |
Table 1. ROUGE-1 scores on the state union dataset.
Methods | State union data | Culture data |
---|---|---|
0.5690 | 0.4966 | |
0.1715 | 0.1580 | |
0.4845 | 0.4534 | |
0.4277 | 0.3793 |
Seoung-Ho Choi
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(4): 317-332 https://doi.org/10.5391/IJFIS.2023.24.4.317Le Van Hoa and Vo Viet Minh Nhat
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(3): 181-193 https://doi.org/10.5391/IJFIS.2024.24.3.181Ho-Seung Kim and Jee-Hyong Lee
International Journal of Fuzzy Logic and Intelligent Systems 2024; 24(2): 83-92 https://doi.org/10.5391/IJFIS.2024.24.2.83