Fugu-MT 論文翻訳(概要): RIFT: Reordered Instruction Following Testbed To Evaluate Instruction Following in Singular Multistep Prompt Structures

論文の概要: RIFT: Reordered Instruction Following Testbed To Evaluate Instruction Following in Singular Multistep Prompt Structures

arxiv url: http://arxiv.org/abs/2601.18924v1
Date: Mon, 26 Jan 2026 19:52:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-28 15:26:51.049439
Title: RIFT: Reordered Instruction Following Testbed To Evaluate Instruction Following in Singular Multistep Prompt Structures
Title（参考訳）: RIFT:Singular Multistep Prompt Structureにおけるインストラクション後のインストラクション評価のためのテストベッドによるリオーダーインストラクション
Authors: Andrew Jaffe, Noah Reicin, Jinho D. Choi,
Abstract要約: コンテンツから構造を引き離すことにより, RIFT (Reordered Instruction following Testbed) を導入する。 6つの最先端のオープンソースLLMにまたがる1万を超える評価では、ジャンプ条件下では精度が最大72%低下した。その結果、現在のアーキテクチャの基本的限界として構造感度が明らかとなった。
参考スコア（独自算出の注目度）: 7.812349915277743
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) are increasingly relied upon for complex workflows, yet their ability to maintain flow of instructions remains underexplored. Existing benchmarks conflate task complexity with structural ordering, making it difficult to isolate the impact of prompt topology on performance. We introduce RIFT, Reordered Instruction Following Testbed, to assess instruction following by disentangling structure from content. Using rephrased Jeopardy! question-answer pairs, we test LLMs across two prompt structures: linear prompts, which progress sequentially, and jumping prompts, which preserve identical content but require non-sequential traversal. Across 10,000 evaluations spanning six state-of-the-art open-source LLMs, accuracy dropped by up to 72% under jumping conditions (compared to baseline), revealing a strong dependence on positional continuity. Error analysis shows that approximately 50% of failures stem from instruction-order violations and semantic drift, indicating that current architectures internalize instruction following as a sequential pattern rather than a reasoning skill. These results reveal structural sensitivity as a fundamental limitation in current architectures, with direct implications for applications requiring non-sequential control flow such as workflow automation and multi-agent systems.
Abstract（参考訳）: 大規模言語モデル(LLM)は、複雑なワークフローにますます依存しているが、命令の流れを維持する能力はいまだ探索されていない。既存のベンチマークでは、タスクの複雑さを構造的な順序付けと説明しており、迅速なトポロジがパフォーマンスに与える影響を分離することは困難である。コンテンツから構造を引き離すことにより, RIFT (Reordered Instruction following Testbed) を導入する。線形プロンプト(線形プロンプト)とジャンププロンプト(ジャンププロンプト)の2つのプロンプト構造でLLMをテストする。 6つの最先端のオープンソースLLMにまたがる1万を超える評価では、ジャンプ条件(ベースラインと比較して)下での精度が最大72%低下し、位置連続性への強い依存が示された。誤り解析により、障害の約50%は命令順序違反と意味的ドリフトによるものであることが示され、現在のアーキテクチャは推論技術ではなく、逐次的なパターンとして命令を内部化することを示している。これらの結果から,ワークフロー自動化やマルチエージェントシステムといった非逐次制御フローを必要とするアプリケーションに直接的な意味を持つ,現在のアーキテクチャの基本的制限としての構造感度が明らかになった。

論文の概要: RIFT: Reordered Instruction Following Testbed To Evaluate Instruction Following in Singular Multistep Prompt Structures

関連論文リスト