Related papers: Faithful Summarisation under Disagreement via Belief-Level Aggregation

Faithful Summarisation under Disagreement via Belief-Level Aggregation

URL: http://arxiv.org/abs/2601.04889v1
Date: Thu, 08 Jan 2026 12:40:47 GMT
Title: Faithful Summarisation under Disagreement via Belief-Level Aggregation
Authors: Favour Yahdii Aghaebe, Tanefa Apekey, Elizabeth Williams, Nafise Sadat Moosavi,
Abstract summary: We introduce a disagreement-aware synthesis pipeline that separates belief-level aggregation from language generation.<n>Our results show that while sufficiently large models can match belief-level aggregation when aggregation is handled at generation time, this behaviour is not stable across architectures or capacities.<n>In contrast, belief-level aggregation combined with simple prompting yields consistently strong disagreement-aware performance across models.
Score: 10.334277776439423
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Opinion and multi-document summarisation often involve genuinely conflicting viewpoints, yet many existing approaches, particularly LLM-based systems, implicitly smooth disagreement and over-represent majority opinions. This limits the faithfulness of generated summaries in opinion-heavy settings. We introduce a disagreement-aware synthesis pipeline that separates belief-level aggregation from language generation. Documents are first represented as structured belief sets and aggregated using distance-based belief merging operators that explicitly model conflict. Large language models are then used only to realise the aggregated beliefs as natural language summaries. We evaluate the approach across multiple model families and scales, comparing it to methods that perform explicit aggregation during generation. Our results show that while sufficiently large models can match belief-level aggregation when aggregation is handled at generation time, this behaviour is not stable across architectures or capacities. In contrast, belief-level aggregation combined with simple prompting yields consistently strong disagreement-aware performance across models, while maintaining fluent and grounded summaries.

Related papers

On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks [56.98385132295952]
We evaluate how well chain-of-thought approaches generalize on a simple planning task.<n>We find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization.<n> purely text-based models consistently outperform those utilizing image-based inputs.
arXiv Detail & Related papers (2026-02-17T09:51:40Z)
Multimodal Fact-Level Attribution for Verifiable Reasoning [80.60864342985748]
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation.<n>Existing multimodal grounding benchmarks and evaluation methods fail to assess attribution in complex multimodal reasoning.<n>We introduce MuRGAt, a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation.
arXiv Detail & Related papers (2026-02-12T03:10:02Z)
Learning to Trust the Crowd: A Multi-Model Consensus Reasoning Engine for Large Language Models [0.0]
Large language models (LLMs) achieve strong aver- age performance yet remain unreliable at the instance level.<n>We introduce a Multi-Model Consensus Reasoning Engine that treats the set of LLM outputs as input to a supervised meta-learner.<n>The system maps natural language responses into structured features using semantic embeddings, pairwise similarity and clustering statistics, lexical and structural cues, reasoning-quality scores, confidence estimates, and model-specific priors.
arXiv Detail & Related papers (2026-01-12T06:27:06Z)
Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations [70.94563079082751]
E-commerce has exposed the limitations of traditional product retrieval systems in managing complex, multi-turn user interactions.<n>We propose a novel framework that introduces test-time scaling into conversational multimodal product retrieval.<n>Our approach builds on a generative retriever, further augmented with a test-time reranking mechanism that improves retrieval accuracy and better aligns results with evolving user intent throughout the dialogue.
arXiv Detail & Related papers (2025-08-25T15:38:56Z)
CORG: Generating Answers from Complex, Interrelated Contexts [57.213304718157985]
In a real-world corpus, knowledge frequently recurs across documents but often contains inconsistencies due to ambiguous naming, outdated information, or errors.<n>Previous research has shown that language models struggle with these complexities, typically focusing on single factors in isolation.<n>We introduce Context Organizer (CORG), a framework that organizes multiple contexts into independently processed groups.
arXiv Detail & Related papers (2025-04-25T02:40:48Z)
Discourse-Driven Evaluation: Unveiling Factual Inconsistency in Long Document Summarization [7.218054628599005]
We study factual inconsistency errors and connect them with a line of discourse analysis.<n>We propose a framework that decomposes long texts into discourse-inspired chunks.
arXiv Detail & Related papers (2025-02-10T06:30:15Z)
Context-Aware Hierarchical Merging for Long Document Summarization [56.96619074316232]
We propose different approaches to enrich hierarchical merging with context from the source document.<n> Experimental results on datasets representing legal and narrative domains show that contextual augmentation consistently outperforms zero-shot and hierarchical merging baselines.
arXiv Detail & Related papers (2025-02-03T01:14:31Z)
SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding [52.98133831401225]
Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence. We propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo. We introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries.
arXiv Detail & Related papers (2024-07-06T16:08:17Z)
Nutribullets Hybrid: Multi-document Health Summarization [36.95954983680022]
We present a method for generating comparative summaries that highlights similarities and contradictions in input documents. Our framework leads to more faithful, relevant and aggregation-sensitive summarization -- while being equally fluent.
arXiv Detail & Related papers (2021-04-08T01:44:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.