Related papers: AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

URL: http://arxiv.org/abs/2602.19127v1
Date: Sun, 22 Feb 2026 10:55:21 GMT
Title: AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Authors: Qijie You, Wenkai Yu, Wentao Zhang,
Abstract summary: We introduce AgenticRAGTracer, a benchmark for agent-based multi-hop reasoning.<n>It is primarily constructed by large language models and designed to support step-by-step validation.<n>Our benchmark spans multiple domains, contains 1,305 data points, and has no overlap with existing mainstream benchmarks.
Score: 7.139631028105273
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine-grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time-consuming and labor-intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step-by-step validation. Our benchmark spans multiple domains, contains 1,305 data points, and has no overlap with existing mainstream benchmarks. Extensive experiments demonstrate that even the best large language models perform poorly on our dataset. For instance, GPT-5 attains merely 22.6\% EM accuracy on the hardest portion of our dataset. Hop-aware diagnosis reveals that failures are primarily driven by distorted reasoning chains -- either collapsing prematurely or wandering into over-extension. This highlights a critical inability to allocate steps consistent with the task's logical structure, providing a diagnostic dimension missing in traditional evaluations. We believe our work will facilitate research in Agentic RAG and inspire further meaningful progress in this area. Our code and data are available at https://github.com/YqjMartin/AgenticRAGTracer.

Related papers

From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning [12.024430772980502]
We introduce an agent-centric benchmarking paradigm for evaluating large language models.<n>A teacher agent generates candidate problems, an orchestrator agent rigorously verifies their validity and guards against adversarial attacks.<n>If a student correctly solves the problem, the orchestrator prompts the teacher to generate more challenging variants.
arXiv Detail & Related papers (2026-02-27T06:54:32Z)
PRInTS: Reward Modeling for Long-Horizon Information Seeking [74.14496236655911]
We introduce PRInTS, a generative PRM trained with dual capabilities.<n>We show that PRInTS enhances information-seeking abilities of open-source models as well as specialized agents.
arXiv Detail & Related papers (2025-11-24T17:09:43Z)
Labels Matter More Than Models: Quantifying the Benefit of Supervised Time Series Anomaly Detection [56.302586730134806]
Time series anomaly detection (TSAD) is a critical data mining task often constrained by label scarcity.<n>Current research predominantly focuses on Unsupervised Time-series Anomaly Detection.<n>This paper challenges the premise that architectural complexity is the optimal path for TSAD.
arXiv Detail & Related papers (2025-11-20T08:32:49Z)
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z)
UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios [63.67884284105684]
We introduce textbfUltraHorizon, a novel benchmark that measures the foundational capabilities essential for complex real-world challenges.<n>Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules.<n>Our experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores.
arXiv Detail & Related papers (2025-09-26T02:04:00Z)
An Empirical Study on Failures in Automated Issue Solving [12.571536148821144]
We analyze the performance and efficiency of three SOTA tools, spanning both pipeline-based and agentic architectures, in automated issue solving tasks of SWE-Bench-Verified.<n>To move from high-level performance metrics to underlying cause analysis, we conducted a systematic manual analysis of 150 failed instances.<n>The results reveal distinct failure fingerprints between the two architectural paradigms, with the majority of agentic failures stemming from flawed reasoning and cognitive deadlocks.
arXiv Detail & Related papers (2025-09-17T13:07:52Z)
GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation [5.002953635224383]
Retrieval-Augmented Generation (RAG) systems are widely adopted in knowledge-intensive NLP tasks.<n>Current evaluations often overlook the structural complexity and multi-step reasoning required in real-world scenarios.<n>We propose textscGRADE, a novel evaluation framework that models task difficulty along two dimensions.
arXiv Detail & Related papers (2025-08-23T11:26:41Z)
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks [229.73714829399802]
This survey probes the core challenges that the rise of Large Language Models poses for evaluation.<n>We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety.<n>We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics.
arXiv Detail & Related papers (2025-04-26T07:48:52Z)
Chain-of-Retrieval Augmented Generation [91.02950964802454]
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.<n>Our proposed method, CoRAG, allows the model to dynamically reformulate the query based on the evolving state.
arXiv Detail & Related papers (2025-01-24T09:12:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.