Long Document Summarization in a Low Resource Setting using Pretrained
Language Models
- URL: http://arxiv.org/abs/2103.00751v1
- Date: Mon, 1 Mar 2021 04:43:55 GMT
- Title: Long Document Summarization in a Low Resource Setting using Pretrained
Language Models
- Authors: Ahsaas Bajaj, Pavitra Dangati, Kalpesh Krishna, Pradhiksha Ashok
Kumar, Rheeya Uppaal, Bradford Windsor, Eliot Brenner, Dominic Dotterrer,
Rajarshi Das and Andrew McCallum
- Abstract summary: We study a challenging low-resource setting of summarizing long legal briefs with an average source document length of 4268 words.
We use a modern pretrained abstractive summarizer BART, which only achieves 17.9 ROUGE-L as it struggles with long sentences.
On feeding the compressed documents to BART, we observe a 6.0 ROUGE-L improvement.
- Score: 28.042826329840437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Abstractive summarization is the task of compressing a long document into a
coherent short document while retaining salient information. Modern abstractive
summarization methods are based on deep neural networks which often require
large training datasets. Since collecting summarization datasets is an
expensive and time-consuming task, practical industrial settings are usually
low-resource. In this paper, we study a challenging low-resource setting of
summarizing long legal briefs with an average source document length of 4268
words and only 120 available (document, summary) pairs. To account for data
scarcity, we used a modern pretrained abstractive summarizer BART (Lewis et
al., 2020), which only achieves 17.9 ROUGE-L as it struggles with long
documents. We thus attempt to compress these long documents by identifying
salient sentences in the source which best ground the summary, using a novel
algorithm based on GPT-2 (Radford et al., 2019) language model perplexity
scores, that operates within the low resource regime. On feeding the compressed
documents to BART, we observe a 6.0 ROUGE-L improvement. Our method also beats
several competitive salience detection baselines. Furthermore, the identified
salient sentences tend to agree with an independent human labeling by domain
experts.
Related papers
- On Positional Bias of Faithfulness for Long-form Summarization [83.63283027830657]
Large Language Models (LLMs) often exhibit positional bias in long-context settings, under-attending to information in the middle of inputs.
We investigate the presence of this bias in long-form summarization, its impact on faithfulness, and various techniques to mitigate this bias.
arXiv Detail & Related papers (2024-10-31T03:50:15Z) - A Novel LLM-based Two-stage Summarization Approach for Long Dialogues [9.835499880812646]
This study proposes a hierarchical framework that segments and condenses information from long documents.
The condensation stage utilizes an unsupervised generation model to generate condensed data.
The summarization stage fine-tunes the abstractive summarization model on the condensed data to generate the final results.
arXiv Detail & Related papers (2024-10-09T03:42:40Z) - Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content.
Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning.
Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z) - On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries.
Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens.
We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z) - Hybrid Long Document Summarization using C2F-FAR and ChatGPT: A
Practical Study [1.933681537640272]
ChatGPT is the latest breakthrough in the field of large language models (LLMs)
We propose a hybrid extraction and summarization pipeline for long documents such as business articles and books.
Our results show that the use of ChatGPT is a very promising but not yet mature approach for summarizing long documents.
arXiv Detail & Related papers (2023-06-01T21:58:33Z) - TSTR: Too Short to Represent, Summarize with Details! Intro-Guided
Extended Summary Generation [22.738731393540633]
In domains where the source text is relatively long-form, such as in scientific documents, such summary is not able to go beyond the general and coarse overview.
In this paper, we propose TSTR, an extractive summarizer that utilizes the introductory information of documents as pointers to their salient information.
arXiv Detail & Related papers (2022-06-02T02:45:31Z) - LDKP: A Dataset for Identifying Keyphrases from Long Scientific
Documents [48.84086818702328]
Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval.
Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information.
This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract.
arXiv Detail & Related papers (2022-03-29T08:44:57Z) - Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues
and Documents [13.755637074366813]
SummN is a simple, flexible, and effective multi-stage framework for input texts longer than the maximum context lengths of typical pretrained LMs.
It can process input text of arbitrary length by adjusting the number of stages while keeping the LM context size fixed.
Our experiments demonstrate that SummN significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2021-10-16T06:19:54Z) - On Generating Extended Summaries of Long Documents [16.149617108647707]
We present a new method for generating extended summaries of long papers.
Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model.
Our analysis shows that our multi-tasking approach can adjust extraction probability distribution to the favor of summary-worthy sentences.
arXiv Detail & Related papers (2020-12-28T08:10:28Z) - SummPip: Unsupervised Multi-Document Summarization with Sentence Graph
Compression [61.97200991151141]
SummPip is an unsupervised method for multi-document summarization.
We convert the original documents to a sentence graph, taking both linguistic and deep representation into account.
We then apply spectral clustering to obtain multiple clusters of sentences, and finally compress each cluster to generate the final summary.
arXiv Detail & Related papers (2020-07-17T13:01:15Z) - From Standard Summarization to New Tasks and Beyond: Summarization with
Manifold Information [77.89755281215079]
Text summarization is the research area aiming at creating a short and condensed version of the original document.
In real-world applications, most of the data is not in a plain text format.
This paper focuses on the survey of these new summarization tasks and approaches in the real-world application.
arXiv Detail & Related papers (2020-05-10T14:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.