Fugu-MT 論文翻訳(概要): Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning

論文の概要: Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning

arxiv url: http://arxiv.org/abs/2604.02007v1
Date: Thu, 02 Apr 2026 13:10:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.802207
Title: Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning
Title（参考訳）: Apriel-Reasoner:汎用・効率的なReasoningのためのRLポストトレーニング
Authors: Rafael Pardinas, Ehsan Kamalloo, David Vazquez, Alexandre Drouin,
Abstract要約: 完全再現可能なマルチドメインRLポストトレーニングレシピをApriel-Base上でトレーニングしたApriel-Reasonerを提案する。不均一なロールアウトダイナミクスにもかかわらず、ターゲット領域比を保存する適応的なドメインサンプリング機構を導入する。 Apriel-Reasonerは、推論時に32Kトークンに一般化し、AIME 2025、GPQA、MMLU-Pro、LiveCodeBenchでApriel-Baseを改善する。
参考スコア（独自算出の注目度）: 49.3394732265528
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.
Abstract（参考訳）: 様々な領域にわたる検証可能な報酬(RLVR)を用いた強化学習を用いた汎用推論モデルの構築は、フロンティアオープンウェイトモデルによって広く採用されている。しかし、それらのトレーニングレシピとドメインミックスは開示されないことが多い。ドメイン間の共同最適化は、ロールアウトの長さ、問題の難しさ、サンプル効率など、大きな課題をもたらす。さらに、長いチェーントレースを持つモデルでは、推論コストと遅延が増加し、実用的なデプロイメントにおいて効率が重要になります。 Apriel-Reasonerは、15BパラメータのオープンウェイトLLMであるApriel-Base上で、数学、コード生成、命令追従、論理パズル、関数呼び出しの5つの領域にまたがって、完全に再現可能なマルチドメインRLポストトレーニングレシピをトレーニングする。異種ロールアウトのダイナミクスにもかかわらず、ターゲットドメイン比を保存する適応型ドメインサンプリング機構を導入し、トレーニングのオーバーヘッドを伴わずに、困難な問題に対する長い推論と簡単な問題に対する短いトレースを推奨する標準長ペナルティの拡張を困難に認識する。厳格な16Kの出力予算で訓練されたApriel-Reasonerは、推論時に32Kトークンに一般化し、AIME 2025、GPQA、MMLU-Pro、LiveCodeBenchでApriel-Baseを上回り、30～50%短い推論トレースを生成する。同様のサイズの強力なオープンウェイトモデルをより低いトークンコストで一致させ、それによってParetoの精度とトークン予算のフロンティアを推し進める。

論文の概要: Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning

関連論文リスト