Fugu-MT 論文翻訳(概要): Learning from Peers in Reasoning Models

論文の概要: Learning from Peers in Reasoning Models

arxiv url: http://arxiv.org/abs/2505.07787v1
Date: Mon, 12 May 2025 17:39:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-13 20:21:49.522925
Title: Learning from Peers in Reasoning Models
Title（参考訳）: 推論モデルにおけるピアからの学習
Authors: Tongxu Luo, Wenyu Du, Jiaxi Bi, Stephen Chung, Zhengyang Tang, Hao Yang, Min Zhang, Benyou Wang,
Abstract要約: 大きな推論モデル(LRM)は、推論パスでミスをしても自己修正する能力を持つ。我々の研究は、推論プロセスが短いが貧弱な開始から始まると、モデルが回復することが困難になることを示している。ピアインタラクションが、すでに正確な個人に悪影響を及ぼすことなく自己補正を促進するという心理学的な知見に触発されて、この現象に対処するために、 **Learning from Peers**(LeaP)を提案する。
参考スコア（独自算出の注目度）: 30.683206230784
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .
Abstract（参考訳）: 大きな推論モデル(LRM)は、推論パスでミスをしても自己修正する能力を持つ。しかし,本研究では,推論プロセスの開始時期が短いが,開始時期が低くなると,モデルが回復することが困難になることを明らかにした。この現象を「Prefix Dominance Trap」と呼ぶ。ピアインタラクションが、すでに正確な個人に悪影響を及ぼすことなく自己補正を促進するという心理学的な知見に触発されて、この現象に対処するために、 **Learning from Peers**(LeaP)を提案する。具体的には、各トークン、それぞれの推論パスがその中間的推論を要約し、ルーティングメカニズムを通じて他のトークンと共有することで、推論中にピアインサイトを組み込むことができる。しかし、より小さなモデルでは、しばしば要約と反射命令を効果的に追従できないことが観察される。これを解決するために、我々はそれらを **LeaP-T* モデルシリーズに微調整する。 AIME 2024、AIME 2025、AIMO 2025、GPQA Diamondの実験では、LeaPが大幅に改善されている。例えば、LeaP を用いた QwQ-32B は平均ベースラインよりも5つの絶対点が高く、平均3.3ポイントの3つのベンチマークで DeepSeek-R1-671B を上回っている。 AIME 2024におけるDeepSeek-R1-Distill-Qwen-14Bのパフォーマンスと、我々の微調整LeaP-T-7Bは一致しています。詳細な分析では、LeaPの堅牢なエラー訂正をタイムリーなピアインサイト(英語版)によって明らかにし、強いエラー耐性を示し、様々なタスクの難しさに対処する。 LeaPは、LRMが推論中にコラボレーションできるようにすることでマイルストーンを達成している。私たちのコード、データセット、モデルはhttps://learning-from-peers.github.io/で公開されています。

論文の概要: Learning from Peers in Reasoning Models

関連論文リスト