Fugu-MT 論文翻訳(概要): AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

論文の概要: AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

arxiv url: http://arxiv.org/abs/2602.19127v1
Date: Sun, 22 Feb 2026 10:55:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-24 17:42:02.493252
Title: AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Title（参考訳）: AgenticRAGTracer : エージェントRAGにおけるマルチステップ検索推論のためのホップアウェアベンチマーク
Authors: Qijie You, Wenkai Yu, Wentao Zhang,
Abstract要約: 本稿ではエージェントベースのマルチホップ推論のベンチマークである AgenticRAGTracer を紹介する。主に大きな言語モデルで構築され、ステップバイステップの検証をサポートするように設計されている。我々のベンチマークは、複数のドメインにまたがり、1,305のデータポイントを含み、既存の主流ベンチマークと重複しない。
参考スコア（独自算出の注目度）: 7.139631028105273
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine-grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time-consuming and labor-intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step-by-step validation. Our benchmark spans multiple domains, contains 1,305 data points, and has no overlap with existing mainstream benchmarks. Extensive experiments demonstrate that even the best large language models perform poorly on our dataset. For instance, GPT-5 attains merely 22.6\% EM accuracy on the hardest portion of our dataset. Hop-aware diagnosis reveals that failures are primarily driven by distorted reasoning chains -- either collapsing prematurely or wandering into over-extension. This highlights a critical inability to allocate steps consistent with the task's logical structure, providing a diagnostic dimension missing in traditional evaluations. We believe our work will facilitate research in Agentic RAG and inspire further meaningful progress in this area. Our code and data are available at https://github.com/YqjMartin/AgenticRAGTracer.
Abstract（参考訳）: 近年,エージェントベースの手法が急速に進歩し,エージェントRAGは間違いなく重要な研究方向となっている。マルチホップ推論(マルチホップ推論、Multi-hop reasoning)は、モデルが意識的な思考や多段階の相互作用に関与することを必要とするもので、そのような能力を評価するための重要なテストベッドとして機能する。しかし、既存のベンチマークでは、最終的な質問と答えのみを提供するのが一般的であるが、アトミックな質問を最終マルチホップクエリに徐々に接続する中間ホップレベルの質問は欠如している。この制限により、研究者はどのステップでエージェントが失敗するかを分析することができなくなり、モデル能力のよりきめ細かい評価が制限される。さらに、現在のベンチマークのほとんどは手作業で構築されているため、時間と労力がかかり、スケーラビリティや一般化も制限されている。これらの課題に対処するため,AgenticRAGTracerという,大規模言語モデルによって主に自動構築され,ステップバイステップ検証をサポートするように設計された,最初のAgentic RAGベンチマークを紹介した。我々のベンチマークは、複数のドメインにまたがり、1,305のデータポイントを含み、既存の主流ベンチマークと重複しない。大規模な実験では、最高の大規模言語モデルでさえデータセット上ではパフォーマンスが悪くなっていることが示されています。例えば、GPT-5はデータセットの最も難しい部分において、わずか22.6\%のEM精度が得られる。ホップアウェア(Hop-aware)の診断によると、障害は主に歪んだ推論チェーンによって引き起こされる。これは、タスクの論理構造と整合したステップを割り当てることができないことを強調し、従来の評価に欠けている診断次元を提供する。我々は,本研究がエージェントRAGの研究を促進し,この分野におけるさらなる意義ある進展を促すと信じている。私たちのコードとデータはhttps://github.com/YqjMartin/AgenticRAGTracer.comで公開されています。

論文の概要: AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

関連論文リスト