Principled Content Selection to Generate Diverse and Personalized Multi-Document Summaries
- URL: http://arxiv.org/abs/2505.21859v1
- Date: Wed, 28 May 2025 01:12:50 GMT
- Title: Principled Content Selection to Generate Diverse and Personalized Multi-Document Summaries
- Authors: Vishakh Padmakumar, Zichao Wang, David Arbour, Jennifer Healey,
- Abstract summary: Large language models exhibit the "lost in the middle" phenomenon.<n>This hinders their ability to cover diverse source material in multi-document summarization.<n>We show that principled content selection is a simple way to increase source coverage on this task.
- Score: 23.46979218958048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large language models (LLMs) are increasingly capable of handling longer contexts, recent work has demonstrated that they exhibit the "lost in the middle" phenomenon (Liu et al., 2024) of unevenly attending to different parts of the provided context. This hinders their ability to cover diverse source material in multi-document summarization, as noted in the DiverseSumm benchmark (Huang et al., 2024). In this work, we contend that principled content selection is a simple way to increase source coverage on this task. As opposed to prompting an LLM to perform the summarization in a single step, we explicitly divide the task into three steps -- (1) reducing document collections to atomic key points, (2) using determinantal point processes (DPP) to perform select key points that prioritize diverse content, and (3) rewriting to the final summary. By combining prompting steps, for extraction and rewriting, with principled techniques, for content selection, we consistently improve source coverage on the DiverseSumm benchmark across various LLMs. Finally, we also show that by incorporating relevance to a provided user intent into the DPP kernel, we can generate personalized summaries that cover relevant source information while retaining coverage.
Related papers
- A Unifying Scheme for Extractive Content Selection Tasks [18.59681132630319]
In this work, we propose textitinstruction-guided content selection (IGCS) as a beneficial unified framework for such settings.<n>To promote this framework, we introduce igcsbench, the first unified benchmark covering diverse content selection tasks.<n>We also create a large generic synthetic dataset that can be leveraged for diverse content selection tasks.
arXiv Detail & Related papers (2025-07-22T18:02:54Z) - PRISM: Fine-Grained Paper-to-Paper Retrieval with Multi-Aspect-Aware Query Optimization [61.783280234747394]
PRISM is a document-to-document retrieval method that introduces multiple, fine-grained representations for both the query and candidate papers.<n>We present SciFullBench, a novel benchmark in which the complete and segmented context of full papers for both queries and candidates is available.<n>Experiments show that PRISM improves performance by an average of 4.3% over existing retrieval baselines.
arXiv Detail & Related papers (2025-07-14T08:41:53Z) - LAQuer: Localized Attribution Queries in Content-grounded Generation [69.60308443863606]
Grounded text generation models often produce content that deviates from their source material, requiring user verification to ensure accuracy.<n>Existing attribution methods associate entire sentences with source documents, which can be overwhelming for users seeking to fact-check specific claims.<n>We introduce Localized Attribution Queries (LAQuer), a new task that localizes selected spans of generated output to their corresponding source spans, allowing fine-grained and user-directed attribution.
arXiv Detail & Related papers (2025-06-01T21:46:23Z) - Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts [67.67746334493302]
Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet they often rely on external context to handle complex tasks.<n>We propose a tri-encoder sequential retriever that models this process as a Markov Decision Process (MDP)<n>We show that our method consistently and significantly outperforms baselines, underscoring the importance of explicitly modeling inter-example dependencies.
arXiv Detail & Related papers (2025-04-15T17:35:56Z) - The Power of Summary-Source Alignments [62.76959473193149]
Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection.
alignment of corresponding sentences between a reference summary and its source documents has been leveraged to generate training data.
This paper proposes extending the summary-source alignment framework by applying it at the more fine-grained proposition span level.
arXiv Detail & Related papers (2024-06-02T19:35:19Z) - A Modular Approach for Multimodal Summarization of TV Shows [55.20132267309382]
We present a modular approach where separate components perform specialized sub-tasks.
Our modules involve detecting scene boundaries, reordering scenes so as to minimize the number of cuts between different events, converting visual information to text, summarizing the dialogue in each scene, and fusing the scene summaries into a final summary for the entire episode.
We also present a new metric, PRISMA, to measure both precision and recall of generated summaries, which we decompose into atomic facts.
arXiv Detail & Related papers (2024-03-06T16:10:01Z) - LLM Based Multi-Document Summarization Exploiting Main-Event Biased
Monotone Submodular Content Extraction [42.171703872560286]
Multi-document summarization is a challenging task due to its inherent subjective bias.
We aim to enhance the objectivity of news summarization by focusing on the main event of a group of related news documents.
arXiv Detail & Related papers (2023-10-05T09:38:09Z) - Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles [136.84278943588652]
We propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event.
To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm.
The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference.
arXiv Detail & Related papers (2023-09-17T20:28:17Z) - Absformer: Transformer-based Model for Unsupervised Multi-Document
Abstractive Summarization [1.066048003460524]
Multi-document summarization (MDS) refers to the task of summarizing the text in multiple documents into a concise summary.
Abstractive MDS aims to generate a coherent and fluent summary for multiple documents using natural language generation techniques.
We propose Absformer, a new Transformer-based method for unsupervised abstractive summary generation.
arXiv Detail & Related papers (2023-06-07T21:18:23Z) - Leveraging Information Bottleneck for Scientific Document Summarization [26.214930773343887]
This paper presents an unsupervised extractive approach to summarize scientific long documents.
Inspired by previous work which uses the Information Bottleneck principle for sentence compression, we extend it to document level summarization.
arXiv Detail & Related papers (2021-10-04T09:43:47Z) - A New Approach to Overgenerating and Scoring Abstractive Summaries [9.060597430218378]
We propose a two-staged strategy to generate a diverse set of candidate summaries from the source text in stage one, then score and select admissible ones in stage two.
Our generator gives a precise control over the length of the summary, which is especially well-suited when space is limited.
Our selectors are designed to predict the optimal summary length and put special emphasis on faithfulness to the original text.
arXiv Detail & Related papers (2021-04-05T00:29:45Z) - SupMMD: A Sentence Importance Model for Extractive Summarization using
Maximum Mean Discrepancy [92.5683788430012]
SupMMD is a novel technique for generic and update summarization based on the maximum discrepancy from kernel two-sample testing.
We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets.
arXiv Detail & Related papers (2020-10-06T09:26:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.