Fugu-MT 論文翻訳(概要): When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning

論文の概要: When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning

arxiv url: http://arxiv.org/abs/2603.22816v1
Date: Tue, 24 Mar 2026 05:38:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.320639
Title: When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning
Title（参考訳）: AIが実際に動くのか? ステップレベル評価は、フロンティア言語モデルを頻繁に独自の推論をバイパスする
Authors: Abhinaba Basu, Pavan Chakraborty,
Abstract要約: 言語モデルは、答える前にステップバイステップの推論を書くことで、ますます"彼らの仕事を示す"。しかし、これらの推論ステップは真に使われているのか、あるいはモデルがすでに決定した後に生成された装飾的な物語なのか? ステップレベルの評価を導入する: 一度に1つの推論文を取り除き、答えが変わるかどうかを確認する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models increasingly "show their work" by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? Consider: a medical AI writes "The patient's eosinophilia and livedo reticularis following catheterization suggest cholesterol embolization syndrome. Answer: B." If we remove the eosinophilia observation, does the diagnosis change? For most frontier models, the answer is no - the step was decorative. We introduce step-level evaluation: remove one reasoning sentence at a time and check whether the answer changes. This simple test requires only API access -- no model weights -- and costs approximately $1-2 per model per task. Testing 10 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, MiniMax-M2.5, Kimi-K2.5, and others) across sentiment, mathematics, topic classification, and medical QA (N=376-500 each), the majority produce decorative reasoning: removing any step changes the answer less than 17% of the time, while any single step alone recovers the answer. This holds even on math, where smaller models (0.8-8B) show genuine step dependence (55% necessity). Two models break the pattern: MiniMax-M2.5 on sentiment (37% necessity) and Kimi-K2.5 on topic classification (39%) - but both shortcut other tasks. Faithfulness is model-specific and task-specific. We also discover "output rigidity": on the same medical questions, Claude Opus writes 11 diagnostic steps while GPT-OSS-120B outputs a single token. Mechanistic analysis (attention patterns) confirms that CoT attention drops more in late layers for decorative tasks (33%) than faithful ones (20%). Implications: step-by-step explanations from frontier models are largely decorative, per-model per-domain evaluation is essential, and training objectives - not scale - determine whether reasoning is genuine.
Abstract（参考訳）: 言語モデルは、答える前にステップバイステップの推論を書くことで、ますます"彼らの仕事を示す"。しかし、これらの推論ステップは真に使われているのか、あるいはモデルがすでに決定した後に生成された装飾的な物語なのか? 医療AIは「カテーテル化後の患者の好酸球とリボリテリシスはコレステロール塞栓症候群を示唆している」と記している。好酸球の観察を取り除いたら、診断は変わりますか? ほとんどのフロンティアモデルでは、答えはノーです。ステップレベルの評価を導入する: 一度に1つの推論文を取り除き、答えが変わるかどうかを確認する。この単純なテストでは、APIアクセス(モデルウェイトなし)しか必要とせず、タスク毎に約1～2ドルかかる。 10のフロンティアモデル(GPT-5.4、Claude Opus、DeepSeek-V3.2、MiniMax-M2.5、Kimi-K2.5など)を感情、数学、トピック分類、医学的QA(N=376-500)でテストし、大多数は装飾的な推論を生み出している。これは数学にも当てはまり、小さなモデル(0.8-8B)は真のステップ依存(55%の必要性)を示す。 MiniMax-M2.5 on sentiment (37% need) と Kimi-K2.5 on topic classification (39%) の2つのモデルがあるが、どちらも他のタスクをショートカットしている。忠実さはモデル固有であり、タスク固有である。同じ質問に対して、Claude Opus氏は11の診断ステップを書き、GPT-OSS-120Bは1つのトークンを出力します。機械的分析(アテンションパターン)は、CoTの注意が忠実なもの(20%)よりも、装飾的なタスクのために後期層で減少することを確認した。意味:フロンティアモデルからのステップバイステップの説明は概ね装飾的であり、ドメイン単位のモデル評価は不可欠である。

論文の概要: When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning

関連論文リスト