Related papers: LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

URL: http://arxiv.org/abs/2509.00676v1
Date: Sun, 31 Aug 2025 03:08:02 GMT
Title: LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
Authors: Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, Furong Huang,
Abstract summary: We show that LLaVA-Critic-R1 emerges as a top-performing critic but also as a competitive policy model.<n>Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks.<n>Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation.
Score: 99.71684530652942
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.

Related papers

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning [89.60378227969643]
We propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision.<n>Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly.<n>Experiments across various tasks and models show that Critique-RL delivers substantial performance improvements.
arXiv Detail & Related papers (2025-10-28T11:37:01Z)
SPARK: Synergistic Policy And Reward Co-Evolving Framework [84.22494672256894]
We introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR.<n>Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model.<n>We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks.
arXiv Detail & Related papers (2025-09-26T17:50:12Z)
RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback [57.967762383794806]
RefCritic is a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards.<n>We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks.
arXiv Detail & Related papers (2025-07-20T16:19:51Z)
Self-Generated Critiques Boost Reward Modeling for Language Models [57.60881438647227]
Critic-RM is a framework that improves reward models using self-generated critiques without extra supervision.<n> Experiments show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges.
arXiv Detail & Related papers (2024-11-25T18:28:26Z)
Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only. Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.