Fugu-MT 論文翻訳(概要): Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

論文の概要: Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

arxiv url: http://arxiv.org/abs/2510.07414v1
Date: Wed, 08 Oct 2025 18:12:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:14.652233
Title: Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation
Title（参考訳）: Haystack Engineering: 異種およびエージェントによる長期評価のためのコンテキストエンジニアリング
Authors: Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li,
Abstract要約: LLM(Long-context Large Language Model)は、"needle-in-a-haystack"ベンチマークでよく機能する。しかし、このようなテストは、バイアス付き検索とエージェントによる注意散逸から、いかにノイズの多いコンテキストが生じるかを見落としている。私たちは、英語のWikipediaハイパーリンクネットワーク上に構築された新しいNIAHベンチマークであるHaystackCraftを通じて、これをインスタンス化する。
参考スコア（独自算出の注目度）: 40.38390243268607
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.
Abstract（参考訳）: 現代の長文大言語モデル(LLM)は、合成された"needle-in-a-haystack"(NIAH)ベンチマークでよく機能するが、このようなテストは、バイアス付き検索やエージェントワークフローからノイズのあるコンテキストがどのように生じるかを見落としている。ヘーススタックエンジニアリングは、異質なバイアスドレトリバーやエージェントワークフローのカスケードエラーなど、重要な現実世界の要因を忠実に捉えたノイズの多い長いコンテキストを構築し、モデルの長期コンテキストの堅牢性をテストするために必要である、と私たちは主張する。私たちはHaystackCraftを通じてこれをインスタンス化します。これは、英語のWikipediaハイパーリンクネットワーク上に構築された新しいNIAHベンチマークで、マルチホップ質問ができます。 HaystackCraftは、異種検索戦略(例えば、スパース、密度、ハイブリッド、グラフベース)が、イントラクタ構成、干し草の順序付け、下流のLLMパフォーマンスにどのように影響するかを評価する。 HaystackCraftはさらにNIAHを拡張し、エージェント操作をシミュレートする動的 LLM 依存設定、クエリを洗練、過去の推論を反映し、いつ停止するかを決定する。 15種類の長文モデルを用いた実験では,(1)より高密度な検索者がより難易度なトラヒックを導入できる一方で,グラフベースのリクエンシングは検索効率を向上し,より有害なトラヒックを緩和する。(2)エージェントテストでは,ジェミニ2.5 ProやGPT-5のような高度なモデルでさえ,自己破壊型トラヒックのカスケード障害や早期停止の障害に悩まされる。これらの結果は、エージェント的長期コンテキスト推論における永続的な課題を強調し、将来の進歩のための価値のあるテストベッドとしてHaystackCraftを確立します。

論文の概要: Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

関連論文リスト