Scaling Multi-Document Event Summarization: Evaluating Compression vs. Full-Text Approaches
- URL: http://arxiv.org/abs/2502.06617v1
- Date: Mon, 10 Feb 2025 16:15:08 GMT
- Title: Scaling Multi-Document Event Summarization: Evaluating Compression vs. Full-Text Approaches
- Authors: Adithya Pratapa, Teruko Mitamura,
- Abstract summary: We contrast two classes of systems for large-scale multi-document summarization (MDS): compression and full-text.
Full-text methods promise a lossless summary by relying on recent advances in long-context reasoning.
We show that compression-based methods show strong promise at intermediate stages, even outperforming full-context.
- Score: 5.856976164399712
- License:
- Abstract: Automatically summarizing large text collections is a valuable tool for document research, with applications in journalism, academic research, legal work, and many other fields. In this work, we contrast two classes of systems for large-scale multi-document summarization (MDS): compression and full-text. Compression-based methods use a multi-stage pipeline and often lead to lossy summaries. Full-text methods promise a lossless summary by relying on recent advances in long-context reasoning. To understand their utility on large-scale MDS, we evaluated them on three datasets, each containing approximately one hundred documents per summary. Our experiments cover a diverse set of long-context transformers (Llama-3.1, Command-R, Jamba-1.5-Mini) and compression methods (retrieval-augmented, hierarchical, incremental). Overall, we find that full-text and retrieval methods perform the best in most settings. With further analysis into the salient information retention patterns, we show that compression-based methods show strong promise at intermediate stages, even outperforming full-context. However, they suffer information loss due to their multi-stage pipeline and lack of global context. Our results highlight the need to develop hybrid approaches that combine compression and full-text approaches for optimal performance on large-scale multi-document summarization.
Related papers
- Context-Aware Hierarchical Merging for Long Document Summarization [56.96619074316232]
We propose different approaches to enrich hierarchical merging with context from the source document.
Experimental results on datasets representing legal and narrative domains show that contextual augmentation consistently outperforms zero-shot and hierarchical merging baselines.
arXiv Detail & Related papers (2025-02-03T01:14:31Z) - BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression [91.23933111083389]
Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge.
This paper presents BRIEF, a lightweight approach that performs query-aware multi-hop reasoning.
Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries.
arXiv Detail & Related papers (2024-10-20T04:24:16Z) - RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective
Augmentation [61.53695868960846]
We propose compressing retrieved documents into textual summaries prior to in-context integration.
This not only reduces the computational costs but also relieves the burden of LMs to identify relevant information in long retrieved documents.
We show that our compressors trained for one LM can transfer to other LMs on the language modeling task and provide summaries largely faithful to the retrieved documents.
arXiv Detail & Related papers (2023-10-06T17:55:36Z) - Hierarchical3D Adapters for Long Video-to-text Summarization [79.01926022762093]
multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
Our experiments demonstrate that multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
arXiv Detail & Related papers (2022-10-10T16:44:36Z) - Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues
and Documents [13.755637074366813]
SummN is a simple, flexible, and effective multi-stage framework for input texts longer than the maximum context lengths of typical pretrained LMs.
It can process input text of arbitrary length by adjusting the number of stages while keeping the LM context size fixed.
Our experiments demonstrate that SummN significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2021-10-16T06:19:54Z) - PoBRL: Optimizing Multi-Document Summarization by Blending Reinforcement
Learning Policies [68.8204255655161]
We propose a reinforcement learning based framework PoBRL for solving multi-document summarization.
Our strategy decouples this multi-objective optimization into different subproblems that can be solved individually by reinforcement learning.
Our empirical analysis shows state-of-the-art performance on several multi-document datasets.
arXiv Detail & Related papers (2021-05-18T02:55:42Z) - On Generating Extended Summaries of Long Documents [16.149617108647707]
We present a new method for generating extended summaries of long papers.
Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model.
Our analysis shows that our multi-tasking approach can adjust extraction probability distribution to the favor of summary-worthy sentences.
arXiv Detail & Related papers (2020-12-28T08:10:28Z) - SummPip: Unsupervised Multi-Document Summarization with Sentence Graph
Compression [61.97200991151141]
SummPip is an unsupervised method for multi-document summarization.
We convert the original documents to a sentence graph, taking both linguistic and deep representation into account.
We then apply spectral clustering to obtain multiple clusters of sentences, and finally compress each cluster to generate the final summary.
arXiv Detail & Related papers (2020-07-17T13:01:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.