Fugu-MT 論文翻訳(概要): LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

論文の概要: LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

arxiv url: http://arxiv.org/abs/2509.00676v1
Date: Sun, 31 Aug 2025 03:08:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.335557
Title: LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
Title（参考訳）: LLaVA-Critic-R1:あなたの批判モデルは秘密裏に強力なポリシーモデル
Authors: Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, Furong Huang,
Abstract要約: LLaVA-Critic-R1は高い評価を受けた批評家としてだけでなく、競争政策モデルとしても現れることを示す。テスト時に自己批判を適用すると、5つの代表的な推論タスクに対して平均+13.8%の改善が得られる。その結果,評価と生成の両面において優れた統一モデルが得られることがわかった。
参考スコア（独自算出の注目度）: 99.71684530652942
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.
Abstract（参考訳）: 視覚言語モデリングでは、批判モデルは通常、反応を生成するのではなく、アウトプット(スカラースコアやペアの選好を割り当てる)を評価するために訓練される。反応を生み出す政策モデルからのこの分離は、批判者が直接の政策利用について考慮されることがほとんどないほど定着している。この作業では、この慣例に挑戦する。本稿では、選好ラベル付き批評家データセットを検証可能な訓練信号に再構成し、ベース生成モデルから直接強化学習を行い、全生成能力を維持しながら選好判断を最適化するマルチモーダル批評家であるLLaVA-Critic-R1を生成することを提案する。驚くべきことに、LLaVA-Critic-R1は、トップパフォーマンスの批評家としてだけでなく、競争政策モデルとしても登場している。26の視覚的推論と理解ベンチマークにわたるドメイン内のデータで訓練されたVLMと、ベースモデル(Qwen-2.5-VL-7B)よりも平均で+5.7%向上している。このアプローチを既存の強力な推論VLMに拡張すると、LLaVA-Critic-R1+は、批判的品質を犠牲にすることなく、さらに政策性能を向上し、7BスケールでMMMUで71.9のSoTA性能を達成する。最後に,評価能力の向上が推論に有効であることを示す。テスト時に自己批判を適用すると,5つの代表的な推論タスクに対して,追加のトレーニングを伴わない平均+13.8%の改善が得られる。この結果から,批判データを用いたRLトレーニングは,評価と生成の両面において優れた統一モデルを生成することができ,スケーラブルで自己改善可能なマルチモーダルシステムへの簡単な道筋を提供することがわかった。

論文の概要: LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

関連論文リスト