Fugu-MT 論文翻訳(概要): Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

論文の概要: Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

arxiv url: http://arxiv.org/abs/2605.27960v1
Date: Wed, 27 May 2026 04:54:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.75488
Title: Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning
Title（参考訳）: Mags-RL:複雑なシーン推論のためのエージェント強化学習によるマルチモーダルLLMの強化ガラス着用
Authors: Xuanzhao Dong, Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xiaobing Yu, Xin Li, Zhipeng Wang, Shao Tang, Gen Li, Yujian Xiong, Hao Wang, Yanxi Chen, Prayag Tiwari, Yalin Wang,
Abstract要約: Mags-RL (Mags-RL) は、MLLMに高精細度検査のための超高分解能「磁性ガラス」エージェントを装備するエージェント強化学習フレームワークである。第1ラウンドでは、最初の合理性を生成し、追加のアノテーションに頼ることなく、自律的に関心のある領域を識別する。第2ラウンドでは、これらの領域を収穫し、スケールアップするために超解像剤を起動し、その後再検討し、最終的な答えを得るための初期の推論を検証する。
参考スコア（独自算出の注目度）: 37.40202473385897
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution "magnifying glass" agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.
Abstract（参考訳）: その人気と成功にもかかわらず、Multimodal Large Language Models (MLLM) は画像の正確な解釈に苦しむことが多く、複雑なシナリオ(例えば、高オブジェクト密度と複雑な背景乱れ)における推論能力を制限する。以前の作業は主に、追加のアノテーションを必要とするバウンディングボックスのような明示的な視覚的キューを組み込むことによって、この制限に対処する。さらに、結果として生じる低解像度の作物は、MLLMが正確な推論に必要とする細かな詳細を見逃すことが多い。そこで我々は, MLLMに高精細度検査のための超高分解能「磁化ガラス」エージェントを装着したエージェント強化学習(RL)フレームワークであるMags-RLを提案する。第1ラウンドでは、追加のアノテーションを頼らずに、関心のある領域を自律的に識別し、第2ラウンドでは、これらの領域をトリミングし、スケールアップするために超解像剤を起動し、その後、その初期の推論を再検討し、最終的な答えを得るための検証を行う。また、データ効率のよいRLトレーニングを可能にする新しいカリキュラム学習戦略を導入し、適切なパフォーマンスを達成するためには、40のトレーニングサンプルしか必要としない。 VSR、TallyQA、GQAサブセットの実験は、最近の強力な競合手法に対して優れた性能を示し、精度の高い視覚的接地による高品質な推論を示す。コードと重みはまもなくリリースされる予定だ。

論文の概要: Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

関連論文リスト