Fugu-MT 論文翻訳(概要): Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning

論文の概要: Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning

arxiv url: http://arxiv.org/abs/2505.12432v1
Date: Sun, 18 May 2025 14:08:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-20 14:57:11.231653
Title: Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning
Title（参考訳）: Observe-R1:動的進行強化学習によるMLLMの推論能力のアンロック
Authors: Zirun Guo, Minjie Hong, Tao Jin,
Abstract要約: マルチモーダル大規模言語モデル(MLLM)の推論能力向上を目的とした新しいフレームワークであるObserve-R1を提案する。我々は,RL学習におけるデータサンプルの難易度と難易度に応じて整理し,サンプル化したNeuraLadderデータセットを構築した。 Qwen2.5-VL-3B と Qwen2.5-VL-7B のニューララダーデータセットから得られた20kサンプルによる実験により、Observe-R1 は推論と一般的なベンチマークの両方において、より大きな推論モデルよりも優れていることが示された。
参考スコア（独自算出の注目度）: 3.364797975300393
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning (RL) has shown promise in improving the reasoning abilities of Large Language Models (LLMs). However, the specific challenges of adapting RL to multimodal data and formats remain relatively unexplored. In this work, we present Observe-R1, a novel framework aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs). We draw inspirations from human learning progression--from simple to complex and easy to difficult, and propose a gradual learning paradigm for MLLMs. To this end, we construct the NeuraLadder dataset, which is organized and sampled according to the difficulty and complexity of data samples for RL training. To tackle multimodal tasks, we introduce a multimodal format constraint that encourages careful observation of images, resulting in enhanced visual abilities and clearer and more structured responses. Additionally, we implement a bonus reward system that favors concise, correct answers within a length constraint, alongside a dynamic weighting mechanism that prioritizes uncertain and medium-difficulty problems, ensuring that more informative samples have a greater impact on training. Our experiments with the Qwen2.5-VL-3B and Qwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show that Observe-R1 outperforms a series of larger reasoning models on both reasoning and general benchmarks, achieving superior clarity and conciseness in reasoning chains. Ablation studies validate the effectiveness of our strategies, highlighting the robustness and generalization of our approach. The dataset and code will be released at https://github.com/zrguo/Observe-R1.
Abstract（参考訳）: 強化学習(RL)は、大規模言語モデル(LLM)の推論能力を改善することを約束している。しかし、RLをマルチモーダルデータやフォーマットに適用する際の具体的な課題は、まだ明らかになっていない。本研究では,マルチモーダル大規模言語モデル(MLLM)の推論能力向上を目的とした新しいフレームワークであるObserve-R1を提案する。我々は、単純から複雑、難易度まで、人間の学習の進歩からインスピレーションを得て、MLLMのための段階的な学習パラダイムを提案する。この目的のために、RLトレーニングのためのデータサンプルの難易度と難易度に応じて編成され、サンプル化されるNeuraLadderデータセットを構築した。マルチモーダルなタスクに対処するために,画像の注意深い観察を促すマルチモーダルなフォーマット制約を導入する。さらに,厳密で正確な解答を長さ制約内に優先する報奨制度と,不確実で中程度の難解な問題を優先する動的重み付け機構を導入し,より情報的なサンプルがトレーニングにより大きな影響を与えることを保証する。ニューララダーデータセットから得られた20kサンプルのQwen2.5-VL-3BモデルとQwen2.5-VL-7Bモデルによる実験により、オブザーバ-R1は、推論と一般的なベンチマークの両方において、より大きな推論モデルよりも優れ、推論チェーンにおいて優れた明瞭さと簡潔性を達成していることが示された。アブレーション研究は、我々の戦略の有効性を検証し、我々のアプローチの堅牢性と一般化を強調する。データセットとコードはhttps://github.com/zrguo/Observe-R1でリリースされる。

論文の概要: Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning

関連論文リスト