Fugu-MT 論文翻訳(概要): Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

論文の概要: Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

arxiv url: http://arxiv.org/abs/2605.05242v1
Date: Sun, 03 May 2026 19:13:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.303133
Title: Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction
Title（参考訳）: 意味的類似性を超えて:直接コーパスインタラクションによるエージェント検索の検索再考
Authors: Zhuofeng Li, Haoxiang Zhang, Cong Wei, Pan Lu, Ping Nie, Yi Lu, Yuyang Bai, Shangbin Feng, Hangxiao Zhu, Ming Zhong, Yuyu Zhang, Jianwen Xie, Yejin Choi, James Zou, Jiawei Han, Wenhu Chen, Jimmy Lin, Dongfu Jiang, Yu Zhang,
Abstract要約: エージェントが直接、汎用端末ツールを用いて、生コーパスを直接検索する直接コーパス間相互作用(DCI)について検討する。このアプローチではオフラインインデックスを必要とせず、ローカルコーパスの進化に自然に適応する。 IRベンチマークとエンドツーエンドのエージェント検索タスク全体にわたって、この単純なセットアップは、強いスパース、密度、リランクベースラインよりも大幅に優れています。
参考スコア（独自算出の注目度）: 127.64173950476702
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.
Abstract（参考訳）: 語彙や意味に拘わらず、現代の検索システムはコーパスを固定された類似性インターフェースを通じて公開し、推論の前に単一のトップk検索ステップへのアクセスを圧縮する。この抽象化は効率的だが, エージェント検索では, 正確な語彙制約, 疎結合, 局所文脈チェック, マルチステップ仮説修正は, 従来のオフ・ザ・シェルフ・レトリバーを呼び出すことで実装が困難であり, 早期にフィルタリングされた証拠は下流のより強い推論によって回収できない。エージェント・タスクは、エージェントが中間的な実体を発見すること、弱い手がかりを組み合わせること、部分的な証拠を観察した後に計画を変更することを含む複数のステップを編成することを要求するため、この制限をさらに悪化させる。この制限に対処するために、エージェントが汎用端末ツール(例えば、grep、ファイル読み込み、シェルコマンド、軽量スクリプト)で生コーパスを直接検索する直接コーパスインタラクション(DCI)について、埋め込みモデル、ベクトルインデックス、検索APIを使わずに検討する。このアプローチではオフラインインデックスを必要とせず、ローカルコーパスの進化に自然に適応する。 IRベンチマークとエンドツーエンドのエージェント検索タスク全体にわたって、この単純なセットアップは、複数のBRIGHTおよびBEIRデータセットの強力なスパース、密度、リランクベースラインを大幅に上回り、従来のセマンティックレトリバーに頼ることなく、BrowseComp-PlusおよびマルチホップQAに対して高い精度を達成する。その結果,言語エージェントが強くなるにつれて,検索品質は推論能力だけでなく,モデルがコーパスと相互作用するインタフェースの解像度にも依存することがわかった。

論文の概要: Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

関連論文リスト