PhyCritic: Multimodal Critic Models for Physical AI
- URL: http://arxiv.org/abs/2602.11124v1
- Date: Wed, 11 Feb 2026 18:35:39 GMT
- Title: PhyCritic: Multimodal Critic Models for Physical AI
- Authors: Tianyi Xiong, Shihao Wang, Guilin Liu, Yi Dong, Ming Li, Heng Huang, Jan Kautz, Zhiding Yu,
- Abstract summary: We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline.<n>We show that PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
- Score: 101.37916322714041
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
Related papers
- Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach [0.0]
This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments.<n>We show how adjusting for rater severity produces corrected estimates of summary quality.<n>This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.
arXiv Detail & Related papers (2026-02-26T03:35:36Z) - Counterfactual Self-Questioning for Stable Policy Optimization in Language Models [0.0]
We propose Counterfactual Self-Questioning, a framework in which a single language model generates and evaluates counterfactual critiques of its own reasoning.<n> Experiments on multiple mathematical reasoning benchmarks show that counterfactual self-questioning improves accuracy and training stability, particularly for smaller models.
arXiv Detail & Related papers (2025-12-31T09:10:37Z) - LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model [99.71684530652942]
We show that LLaVA-Critic-R1 emerges as a top-performing critic but also as a competitive policy model.<n>Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks.<n>Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation.
arXiv Detail & Related papers (2025-08-31T03:08:02Z) - RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback [57.967762383794806]
RefCritic is a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards.<n>We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks.
arXiv Detail & Related papers (2025-07-20T16:19:51Z) - Training Language Model to Critique for Better Refinement [58.73039433159486]
We introduce textbfRefinement-oriented textbfCritique textbfOptimization (RCO), a novel framework designed to train critic models using refinement signals.<n>RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses.<n>By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment.
arXiv Detail & Related papers (2025-06-27T12:10:57Z) - RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.861013614500024]
We introduce a new benchmark designed to assess the critique capabilities of Large Language Models (LLMs)<n>Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques.
arXiv Detail & Related papers (2025-01-24T13:48:10Z) - Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning [46.411313304605564]
Critic-V is a framework inspired by the Actor-Critic paradigm to boost the reasoning capability of vision-language models (VLMs)<n>The Reasoner generates reasoning paths based on visual and textual inputs, and the Critic provides constructive critique to refine these paths.<n> evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks.
arXiv Detail & Related papers (2024-11-27T10:28:57Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.