Fugu-MT 論文翻訳(概要): MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

論文の概要: MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

arxiv url: http://arxiv.org/abs/2605.01520v1
Date: Sat, 02 May 2026 16:21:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.818092
Title: MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models
Title（参考訳）: MIRL:視覚言語モデルのための相互情報誘導強化学習
Authors: Yin Zhang, Jiaxuan Zhao, Zonghan Wu, Zengxiang Li, Junfeng Fang, Kun Wang, Qingsong Wen, Yilei Shao,
Abstract要約: Reinforcement Learning with Verifiable Rewards (RLVR)は、回答の正当性信号を用いてポリシーを最適化することで、有望なソリューションを提供する。我々は、生成した記述と視覚入力の相互情報(MI)を安価な事前スクリーニング信号として利用することにより、両方の制約に対処する分離されたフレームワークであるMIRLを紹介する。 6つの視覚言語推論ベンチマークの実験により、MIRLの平均精度は70.22%に達した。
参考スコア（独自算出の注目度）: 46.54440573184562
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while decoupled training provides independent MI-based rewards for visual perception optimization, resolving reward blindness. Experiments on six vision-language reasoning benchmarks demonstrate that MIRL achieves 70.22% average accuracy and successfully surpasses the performance of sampling 16 complete trajectories using only 10 pre-samples with top-6 selection (25% fewer complete trajectories). Our code is available at: https://anonymous.4open.science/r/mirl-main/.
Abstract（参考訳）: VLM(Vision-Language Models)は、複雑な推論タスクにおいて、解答の精度を損なう視覚的認識誤差や幻覚にしばしば悩まされる。 Reinforcement Learning with Verifiable Rewards (RLVR)は、回答の正当性信号を用いてポリシーを最適化することで、有望なソリューションを提供する。その効果にもかかわらず、一般的なRLVR法は2つの限界に直面している。第一に、サンプリング予算の大部分は、初期の視覚的記述エラーによって失敗する運命にある軌跡に費やされる。第二に、スパース報酬は、失敗が視覚的知覚または推論段階に起因するかどうかを区別できない。我々は、生成した記述と視覚入力の相互情報(MI)を安価な事前スクリーニング信号として利用することにより、両方の制約に対処する分離されたフレームワークであるMIRLを紹介する。これにより、フォキングによる高電位軌道へのインテリジェントな予算配分が可能となり、非結合トレーニングは、視覚的知覚最適化のためのMIベースの独立した報酬を提供し、報酬の盲点を解消する。 6つの視覚言語推論ベンチマークの実験では、MIRLは平均精度70.22%に達し、トップ6選択の10個の事前サンプル(25%少ない完全軌道)を使用して16個の完全な軌道をサンプリングする性能を達成している。私たちのコードは、https://anonymous.4open.science/r/mirl-main/で利用可能です。

論文の概要: MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

関連論文リスト