Related papers: LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text

LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text

URL: http://arxiv.org/abs/2508.15085v1
Date: Wed, 20 Aug 2025 21:41:42 GMT
Title: LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text
Authors: MohamamdJavad Ardestani, Ehsan Kamalloo, Davood Rafiei,
Abstract summary: LongRecall is a three-stage recall evaluation framework.<n>It decomposes answers into self-contained facts, narrows plausible candidate matches through lexical and semantic filtering, and verifies alignment.<n>We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges.
Score: 14.211177885010029
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LongRecall. The completeness of machine-generated text, ensuring that it captures all relevant information, is crucial in domains such as medicine and law and in tasks like list-based question answering (QA), where omissions can have serious consequences. However, existing recall metrics often depend on lexical overlap, leading to errors with unsubstantiated entities and paraphrased answers, while LLM-as-a-Judge methods with long holistic prompts capture broader semantics but remain prone to misalignment and hallucinations without structured verification. We introduce LongRecall, a general three-stage recall evaluation framework that decomposes answers into self-contained facts, successively narrows plausible candidate matches through lexical and semantic filtering, and verifies their alignment through structured entailment checks. This design reduces false positives and false negatives while accommodating diverse phrasings and contextual variations, serving as a foundational building block for systematic recall assessment. We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges, demonstrating substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines.

Related papers

Reasoning-Augmented Representations for Multimodal Retrieval [27.4146940988752]
Universal Multimodal Retrieval (UMR) seeks any-to-any search across text and vision.<n>We argue this brittleness is often data-specified: when images carry "silent" evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress.<n>We propose a data-centric framework that decouples these roles by externalizing reasoning before retrieval.
arXiv Detail & Related papers (2026-02-06T19:01:54Z)
Hybrid Retrieval-Augmented Generation for Robust Multilingual Document Question Answering [0.3376269351435395]
Large-scale digitization initiatives have unlocked massive collections of historical newspapers.<n>We develop and evaluate a multilingual Retrieval-Augmented Generation pipeline specifically designed for question answering on noisy historical documents.
arXiv Detail & Related papers (2025-12-14T13:57:05Z)
Stress Testing Factual Consistency Metrics for Long-Document Summarization [36.761145124360944]
We systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization.<n>We probe metric robustness through seven factuality-preserving perturbations applied to summaries.<n>Our results reveal that existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims.
arXiv Detail & Related papers (2025-11-10T23:24:25Z)
Reasoning-enhanced Query Understanding through Decomposition and Interpretation [87.56450566014625]
ReDI is a Reasoning-enhanced approach for query understanding through Decomposition and Interpretation.<n>We compiled a large-scale dataset of real-world complex queries from a major search engine.<n> Experiments on BRIGHT and BEIR demonstrate that ReDI consistently surpasses strong baselines in both sparse and dense retrieval paradigms.
arXiv Detail & Related papers (2025-09-08T10:58:42Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z)
Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning [9.719614935865906]
This paper investigates Large Language Models (LLMs) reasoning ability over code snippets within large repositories.<n>We differentiate between lexical code recall (verbatim retrieval) and semantic code recall (remembering what the code does)<n>Our evaluation of state-of-the-art LLMs reveals a significant drop in code reasoning accuracy as a code snippet approaches the middle of the input context.
arXiv Detail & Related papers (2025-05-19T16:56:31Z)
MRAG: A Modular Retrieval Framework for Time-Sensitive Question Answering [3.117448929160824]
temporal relations and answering time-sensitive questions is a challenging task for question-answering systems powered by large language models (LLMs)<n>We introduce the TempRAGEval benchmark, which repurposes existing datasets by incorporating temporal perturbations and gold evidence labels.<n>On TempRAGEval, MRAG significantly outperforms baseline retrievers in retrieval performance, leading to further improvements in final answer accuracy.
arXiv Detail & Related papers (2024-12-20T03:58:27Z)
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z)
Atomic Fact Decomposition Helps Attributed Question Answering [30.75332718824254]
Attributed Question Answering (AQA) aims to provide both a trustworthy answer and a reliable attribution report for a question. This paper proposes an Atomic fact decomposition-based Retrieval and Editing framework. It decomposes the generated long-form answers into molecular clauses and atomic facts by the instruction-tuned LLMs.
arXiv Detail & Related papers (2024-10-22T05:25:54Z)
FABLES: Evaluating faithfulness and content selection in book-length summarization [55.50680057160788]
In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on book-length documents. We collect FABLES, a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, at a cost of $5.2K USD. An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate.
arXiv Detail & Related papers (2024-04-01T17:33:38Z)
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection. It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z)
RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought [56.558892336235914]
Reversing Chain-of-Thought (RCoT) is a novel method to improve large language models' reasoning abilities. RCoT automatically detects and rectifys factual inconsistency in generated solutions. We show that manually written fine-grained feedback can dramatically improve LLMs' reasoning abilities.
arXiv Detail & Related papers (2023-05-19T08:02:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.