Fugu-MT 論文翻訳(概要): ARM: Adaptive Reasoning Model

論文の概要: ARM: Adaptive Reasoning Model

arxiv url: http://arxiv.org/abs/2505.20258v1
Date: Mon, 26 May 2025 17:38:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-28 14:37:20.354514
Title: ARM: Adaptive Reasoning Model
Title（参考訳）: ARM:Adaptive Reasoning Model
Authors: Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, Yanghua Xiao,
Abstract要約: 本稿では,そのタスクに基づいて適切なフォーマットを適応的に選択できる推論モデルであるAdaptive Reasoning Model (ARM)を提案する。 Ada-GRPOはARMが高いトークン効率を実現し、Long CoTのみに依存するモデルに匹敵するパフォーマンスを維持しながら、トークンを平均30%、最大70%削減する。
参考スコア（独自算出の注目度）: 36.53965139929349
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.
Abstract（参考訳）: 大きな推論モデルは複雑なタスクに対して強いパフォーマンスを示すが、タスクの難易度に基づいて推論トークンの使用を調整する能力は欠如している。これはしばしば「過度で不要な推論」という「過度に考える」問題につながり、トークン予算の制御に対する人間の介入によって緩和される可能性があるが、それでも完全に自律的なAIを達成するという目標とは根本的に矛盾する。本研究では,そのタスクに基づいて適切な推論形式を適応的に選択できる推論モデルであるAdaptive Reasoning Model (ARM)を提案する。これらのフォーマットには、より精巧なフォーマットであるLong CoTだけでなく、Direct Answer、Short CoT、Codeの3つの効率的なフォーマットが含まれている。 ARMのトレーニングには,従来のGRPOのフォーマット崩壊問題に対処するグループ相対政策最適化(GRPO)の適応であるAda-GRPOを導入する。 Ada-GRPOはARMが高いトークン効率を実現し、Long CoTのみに依存するモデルに匹敵するパフォーマンスを維持しながら、トークンを平均30%、最大70%削減する。さらに、トークン生成の削減による推論効率の向上だけでなく、トレーニングの2倍のスピードアップも実現している。デフォルトのAdaptive Modeに加えて、ARMは2つの追加の推論モードをサポートしている。 1) インストラクションガイドモード – 特別なトークンを使って推論フォーマットを明示的に指定することが可能で、タスクのバッチで適切なフォーマットが知られている場合に理想的だ。 2) 3つの効率的なフォーマットの出力を集約したコンセンサスガイドモードでは,トークン使用率の向上による性能の優先順位付けを行う。

論文の概要: ARM: Adaptive Reasoning Model

関連論文リスト