Fugu-MT 論文翻訳(概要): MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

論文の概要: MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

arxiv url: http://arxiv.org/abs/2507.14683v1
Date: Sat, 19 Jul 2025 16:21:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-22 20:51:32.002648
Title: MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization
Title（参考訳）: MiroMind-M1:コンテキスト対応マルチステージポリシー最適化による数学的推論のオープンソース向上
Authors: Xingxuan Li, Yao Xiao, Dianwen Ng, Hai Ye, Yue Deng, Xiang Lin, Bin Wang, Zhanfeng Mo, Chong Zhang, Yueyi Zhang, Zonglin Yang, Ruilin Li, Lei Lei, Shihao Xu, Han Zhao, Weiling Chen, Feng Ji, Lidong Bing,
Abstract要約: MiroMind-M1 は Qwen-2.5 ベースのベンチマーク上に構築された完全なオープンソース RLM のセットである。我々のモデルは2つの段階で訓練されている: SFT on a carefully curated corpus of 719K math-reasoning problem with confirmed CoT trajectories, then RLVR on 62K challenge and verible problem。
参考スコア（独自算出の注目度）: 74.04867639197445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.
Abstract（参考訳）: 大規模言語モデルは、最近、流動的なテキスト生成から様々な分野にわたる高度な推論へと進化し、推論言語モデルを生み出した。これらの領域の中で、数学的推論は他のタスクに一般化できる正確な多段階論理と抽象的推論を必要とするため、代表的ベンチマークとして機能する。 GPT-o3のようなクローズドソースのRLMは印象的な推論機能を示しているが、そのプロプライエタリな性質は透明性と再現性を制限している。多くのオープンソースプロジェクトは、このギャップを埋めようとしているが、そのほとんどは、データセットや詳細なトレーニング設定といった重要なリソースを省略することで、再現性を妨げている、十分なオープンさを欠いている。このシリーズは、Qwen-2.5 のバックボーン上に構築され、既存のオープンソース RLM のパフォーマンスに適合または超越した完全なオープンソース RLM のセットである。特に、我々のモデルは2段階に分けて訓練されている: SFT on a carefully curated corpus of 719K math-reasoning problem with confirmed CoT trajectories, then RLVR on 62K challenge and verible problem。 RLVRプロセスの堅牢性と効率性を高めるため,コンテクスト対応多段階ポリシー最適化(Context-Aware Multi-Stage Policy Optimization)を導入する。 AIME24, AIME25, MATH ベンチマークを用いて,Qwen-2.5 ベースのオープンソース 7B および 32B モデルにおいて,最先端ないし競争的な性能と優れたトークン効率を実現する。再現性を高めるため,モデル (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B) とデータセット (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K) とすべてのトレーニングおよび評価設定を作成した。これらの資源がさらなる研究を支援し、コミュニティの発展を促進することを願っている。

関連論文リスト

RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
大規模言語モデル(LLM)のための新しいハイブリッド政治最適化手法RL-PLUSを提案する。 RL-PLUSは、外部データと内部エクスプロイトを相乗化して、より強力な推論能力を達成し、ベースモデルのバウンダリを超える。提案手法の優位性と一般化性を示すため,理論解析と広範な実験を行った。
論文参考訳（メタデータ） (2025-07-31T23:55:29Z)
MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning [4.963955559863751]
MMAT-1Mは、CoT、リフレクション、動的ツールの使用をサポートするために設計された最初の100万規模のマルチモーダルエージェントチューニングデータセットである。我々のデータセットは、新しい4段階のデータエンジンによって構築されます。 MMAT-1M上でのオープンソースのマルチモーダルモデルを微調整することにより,大幅な性能向上を観測する。
論文参考訳（メタデータ） (2025-07-29T15:39:14Z)
CIMR: Contextualized Iterative Multimodal Reasoning for Robust Instruction Following in LVLMs [2.238122883754112]
CIMRは、コンテキスト対応の反復推論と自己補正モジュールを導入した、新しいフレームワークである。 CIMRの精度は91.5%で、GPT-4V、LLaVA-1.5、MiniGPT-4、InstructBLIPなどの最先端モデルを上回る。
論文参考訳（メタデータ） (2025-07-22T18:39:18Z)
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning [28.92744927199283]
ReVisual-R1は、MathVerse、MathVision、WeMath、LogicVista、DynaMath、AIME2024、AIME2025といった挑戦的なベンチマークにおいて、オープンソースの7B MLLMの間で新しい最先端技術を実現している。
論文参考訳（メタデータ） (2025-06-04T17:51:08Z)
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning [55.82649731348012]
MMK12データセットとMM-EUREKAを7B,32Bパラメータで導入する。前者は、人間の検証された答えと解法を含む多様な知識領域を特徴とする高品質なマルチモーダル数学推論データセットである。後者は,オンラインフィルタリングを利用したルールベース強化学習と,トレーニング安定性を高めるための2段階トレーニング戦略を用いたマルチモーダルモデルである。
論文参考訳（メタデータ） (2025-03-10T14:23:12Z)
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [24.45348222168512]
マルチモーダル推論能力向上のためのMLLMであるVision-R1を提案する。我々のモデルは、様々なマルチモーダル数学推論ベンチマークにおいて、$sim$6%の平均的な改善を達成している。 Vision-R1-7Bは広く使われているMathVistaベンチマークで73.5%の精度を実現している。
論文参考訳（メタデータ） (2025-03-09T20:06:45Z)
START: Self-taught Reasoner with Tools [51.38785489790888]
ツール統合長チェーン・オブ・シークレット(CoT)推論LSMであるSTART(Self-Taught Reasoner with Tools)を紹介する。 STARTは複雑な計算、自己チェック、多様な方法の探索、そして自己老化を行うことができる。基礎となるQwQ-32Bを著しく上回り、最先端のオープンウェイトモデルR1-Distill-Qwen-32Bに匹敵する性能を達成する。
論文参考訳（メタデータ） (2025-03-06T17:11:51Z)
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics [23.80647785460245]
Process Reward Models (PRM) は、大規模言語モデルの数学的推論能力を高めることを約束している。マルチモーダル数学的推論におけるPRMの可能性を解き明かすための第一歩を踏み出す。 URSAは3段階のUnfolding Multimodal Process-Supervision Aided Trainingフレームワークである。
論文参考訳（メタデータ） (2025-01-08T18:49:41Z)
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
MLLM(Multimodal large language model)は、多モーダルタスクにおいて大きな可能性を秘めている。既存の命令チューニングデータセットは、中間的合理性のないフレーズレベルの答えのみを提供する。そこで本研究では,大規模マルチモーダル・インストラクション・チューニング・データセットを構築するためのスケーラブルで費用対効果の高い手法を提案する。
論文参考訳（メタデータ） (2024-12-06T18:14:24Z)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
我々は、MLLMのマルチモーダル推論能力を高めるために、選好最適化(PO)プロセスを導入する。具体的には、自動選好データ構築パイプラインを設計し、高品質で大規模なマルチモーダル推論選好データセットであるMMPRを作成する。マルチモーダルCoT性能を向上するMPO(Mixed Preference Optimization)と呼ばれるシンプルな手法を開発した。
論文参考訳（メタデータ） (2024-11-15T18:59:27Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。