Fugu-MT 論文翻訳(概要): MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning

論文の概要: MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning

arxiv url: http://arxiv.org/abs/2509.21788v1
Date: Fri, 26 Sep 2025 02:43:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.13556
Title: MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning
Title（参考訳）: MIRG-RL:強化学習によるマルチイメージ推論とグラウンド化
Authors: Lihao Zheng, Jiawei Chen, Xintian Shen, Hao Ma, Tao Wei,
Abstract要約: 現在、LVLM(Large Visual Language Models)は2つの重要な課題に直面している。強化学習によるマルチイメージ推論とグラウンド化(MIRG-RL)の統一フレームワークを提案する。具体的には、教師付き微調整と注釈付き軌跡と画像認識強化学習最適化を組み合わせた2段階の訓練パラダイムを提案する。
参考スコア（独自算出の注目度）: 10.049259114211663
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-image reasoning and grounding require understanding complex cross-image relationships at both object levels and image levels. Current Large Visual Language Models (LVLMs) face two critical challenges: the lack of cross-image reasoning capabilities and insufficient cross-image reference reward modeling. To address these issues, we propose a unified framework - Multi-Image Reasoning and Grounding with Reinforcement Learning (MIRG-RL). Specifically, our two-stage training paradigm combines supervised fine-tuning with annotated trajectories and image-aware reinforcement learning optimization, progressively developing multi-image reasoning capabilities. Furthermore, we innovatively propose a method for constructing the trajectory data, which integrates object-level and image-level annotation information, and use this method to generate a lightweight reasoning-enhanced dataset. To effectively resolve cross-image ambiguities, we design an image-aware RL policy with dual reward functions for objects and images. Experiments demonstrate that MIRG-RL achieves state-of-the-art (SOTA) performance in multi-image grounding benchmarks, attaining 64.82% on cross-image reasoning tasks - exceeding the previous best method by 1%. The code and dataset have been released at https://github.com/ZEUS2035/MIRG-RL.
Abstract（参考訳）: マルチイメージ推論とグラウンド推論は、オブジェクトレベルとイメージレベルの両方において複雑なクロスイメージ関係を理解する必要がある。現在、LVLM(Large Visual Language Models)は2つの重要な課題に直面している。これらの課題に対処するため,MIRG-RL(Multi-Image Reasoning and Grounding with Reinforcement Learning)という統合フレームワークを提案する。具体的には、教師付き微調整と注釈付き軌跡と画像認識強化学習最適化を組み合わせて、段階的にマルチイメージ推論機能を開発する。さらに,オブジェクトレベルのアノテーション情報と画像レベルのアノテーション情報を統合するトラジェクトリデータ構築手法を革新的に提案し,この手法を用いて軽量な推論型データセットを生成する。画像間のあいまいさを効果的に解決するために,オブジェクトや画像に対して2つの報酬関数を持つ画像認識型RLポリシーを設計する。実験により、MIRG-RLはマルチイメージグラウンドのベンチマークで最先端(SOTA)のパフォーマンスを達成し、64.82%のクロスイメージ推論タスクを達成した。コードとデータセットはhttps://github.com/ZEUS2035/MIRG-RLでリリースされた。

論文の概要: MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning

関連論文リスト