Related papers: Toward Faithful and Complete Answer Construction from a Single Document

Toward Faithful and Complete Answer Construction from a Single Document

URL: http://arxiv.org/abs/2602.06103v1
Date: Thu, 05 Feb 2026 18:22:08 GMT
Title: Toward Faithful and Complete Answer Construction from a Single Document
Authors: Zhaoyang Chen, Cody Fleming,
Abstract summary: We present EVE, a structured framework for document-grounded reasoning.<n>Unlike free-form prompting, EVE constrains generation to a structured, verifiable pipeline that decomposes high-rigor reasoning into extraction, validation, and enumeration.
Score: 1.0742675209112622
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern large language models (LLMs) are powerful generators driven by statistical next-token prediction. While effective at producing fluent text, this design biases models toward high-probability continuations rather than exhaustive and faithful answers grounded in source content. As a result, directly applying LLMs lacks systematic mechanisms to ensure both completeness (avoiding omissions) and faithfulness (avoiding unsupported content), which fundamentally conflicts with core AI safety principles. To address this limitation, we present EVE, a structured framework for document-grounded reasoning. Unlike free-form prompting, EVE constrains generation to a structured, verifiable pipeline that decomposes high-rigor reasoning into extraction, validation, and enumeration. Empirically, this design enables consistent and simultaneous improvements in recall, precision, and F1-score: recall and precision increase by up to 24\% and 29\%, respectively, with a corresponding 31\% gain in F1-score. This effectively breaks the long-standing trade-off between coverage and accuracy typical of single-pass LLM generation, while also mitigating generation truncation caused by length limitations. At the same time, we emphasize that EVE exhibits performance saturation due to the inherent ambiguity of natural language, reflecting fundamental limits of language-based reasoning.

Related papers

Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration [48.19579266939883]
Diffusion large language models (dLLMs) have attracted significant attention for their ability to enhance diversity, controllability, and parallelism.<n>We propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs.
arXiv Detail & Related papers (2026-03-03T08:58:20Z)
Lookahead-then-Verify: Reliable Constrained Decoding for Diffusion LLMs under Context-Free Grammars [17.13122301190815]
We present LAVE, a constrained decoding approach specifically designed for dLLMs.<n>Our approach leverages a key property of dLLMs, namely their ability to predict token distributions for all positions in parallel during each forward pass.<n>Extensive experiments across four widely used dLLMs and three representative benchmarks demonstrate that LAVE consistently outperforms existing baselines and achieves substantial improvements in syntactic correctness, while incurring negligible runtime overhead.
arXiv Detail & Related papers (2026-01-31T08:58:15Z)
VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension [51.76841625486355]
Referring Expression (REC) aims to localize the image region corresponding to a natural-language query.<n>Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning.<n>We introduce VIRO, a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps.
arXiv Detail & Related papers (2026-01-19T07:21:19Z)
Thinking Before Constraining: A Unified Decoding Framework for Large Language Models [1.2468700211588883]
We propose a simple approach that combines the advantages of both natural and structured generation.<n>Our method preserves the expressive power of natural language reasoning while ensuring the reliability of structured outputs.
arXiv Detail & Related papers (2026-01-12T13:25:28Z)
VIGOR+: Iterative Confounder Generation and Validation via LLM-CEVAE Feedback Loop [14.309475903975441]
Recent advances leverage Large Language Models to generate plausible hidden confounders based on domain knowledge.<n>We propose VIGOR+, a novel framework that closes the loop between LLM-based confounder generation and CEVAE-based statistical validation.<n>We formalize the feedback mechanism, prove convergence properties under mild assumptions, and provide a complete algorithmic framework.
arXiv Detail & Related papers (2025-12-22T12:48:29Z)
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective [85.06838178922791]
Reinforcement Learning (RL) has proven highly effective for autoregressive language models.<n>But adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges.<n>We propose a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy.
arXiv Detail & Related papers (2025-12-03T13:05:32Z)
Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation [63.49409574310576]
Large language models (LLMs) exhibit overconfidence, assigning high confidence scores to incorrect predictions.<n>We introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation.<n>Our code and all baselines used in the paper are available on GitHub.
arXiv Detail & Related papers (2025-08-16T13:29:35Z)
EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming [8.52533297070733]
EVALOOOP is an assessment framework that evaluates robustness from a self-consistency perspective.<n>We evaluate 96 popular large language models (LLMs) on the MBPP Plus benchmark.<n> EVALOOOP induces a 2.65%-47.62% absolute drop in pass@1 accuracy within ten loops.
arXiv Detail & Related papers (2025-05-18T01:02:33Z)
Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [41.19330514054401]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z)
Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution [61.80716438091887]
GenDiE (Generate, Discriminate, Evolve) is a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization.<n>By treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches.<n>Experiments on ASQA (in-domain LFQA) and ConFiQA datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness.
arXiv Detail & Related papers (2025-03-03T16:08:33Z)
Aligning Large Language Models for Faithful Integrity Against Opposing Argument [71.33552795870544]
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks.<n>They can be easily misled by unfaithful arguments during conversations, even when their original statements are correct.<n>We propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation.
arXiv Detail & Related papers (2025-01-02T16:38:21Z)
Rethinking Uncertainty Estimation in Natural Language Generation [6.3398383724486544]
Large Language Models (LLMs) are increasingly employed in real-world applications.<n>Uncertainty estimation methods generate and analyze multiple output sequences to determine the LLM's uncertainty.<n>We propose G-NLL, which has the advantage of being obtained using only a single output sequence.
arXiv Detail & Related papers (2024-12-19T18:51:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.