Fugu-MT 論文翻訳(概要): From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

論文の概要: From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

arxiv url: http://arxiv.org/abs/2509.23196v1
Date: Sat, 27 Sep 2025 08:59:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.098853
Title: From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs
Title（参考訳）: Harmからヘルプへ:インコンテクストのデモをALMのアセットに変える
Authors: Haonan Wang, Weida Liang, Zihang Fu, Nie Zheng, Yifan Zhang, Yao Tong, Tongyao Zhu, Hao Jiang, Chuang Li, Jiaying Wu, Kenji Kawaguchi,
Abstract要約: デモとしてDeepSeek-R1の高品質なトレースを使って、このパラドックスを再検討する。デモが最適であっても、より多くの例を加えることで、常に精度が低下することがわかった。デモを明示的で再利用可能な洞察に変換するシーケンシャルなテストタイム手順であるInsight-to-solve(I2S)を紹介します。
参考スコア（独自算出の注目度）: 58.02809208460186
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.
Abstract（参考訳）: 近年のLLM (RLMs) は、特に検証器に基づく強化学習で訓練された場合、直接応答するよりも、数発のCoTで悪化することが多い。我々は、DeepSeek-R1の高品質な推論トレースをデモとして使用して、このパラドックスを再検討し、デモが最適であっても、より多くの例を追加しても常に精度が低下することを発見した。詳細な分析では、この減少の背景にある2つのメカニズムが明らかになっている。 (i)意味誤認であって、高いテキスト類似性が、対象を模範と同一扱いし、中間ステップを冗長にコピーするモデルに導くもの二モデルが有用な推論戦略を抽出し、対象とする問題に適用するのに苦労する戦略伝達失敗。これらは、インサイト・トゥ・ソルブ(I2S)と呼ばれるシーケンシャルなテスト時間プロシージャを導入し、デモを明示的で再利用可能な洞察に変換し、ターゲット固有の推論トレースを導出する。多様なベンチマークに関する大規模な実験によると、I2SとI2S+は、オープンソースモデルとクローズドソースモデルの両方で直接応答とテスト時間スケーリングのベースラインを一貫して上回っている。 GPTモデルでさえ、AIME'25では、GPT-4.1は+14.0%上昇し、o1-miniは+2.7%向上し、GPQAでは+1.7%向上し、インコンテキストでのデモンストレーションはインサイトリファインダー解決フレームワークによって効果的に活用できることを示す。

論文の概要: From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

関連論文リスト