Related papers: Revisiting the Learning Objectives of Vision-Language Reward Models

Revisiting the Learning Objectives of Vision-Language Reward Models

URL: http://arxiv.org/abs/2512.20675v1
Date: Sat, 20 Dec 2025 19:50:36 GMT
Title: Revisiting the Learning Objectives of Vision-Language Reward Models
Authors: Simon Roy, Samuel Barbeau, Giovanni Beltrame, Christian Desrosiers, Nicolas Thome,
Abstract summary: Learning generalizable reward functions is a core challenge in embodied intelligence.<n>Recent work leverages contrastive vision language models (VLMs) to obtain dense, domain-agnostic rewards without human supervision.<n>We evaluate recent VLM-based reward models under a unified framework with identical backbones, finetuning data, and evaluation environments.
Score: 19.768973349254285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning generalizable reward functions is a core challenge in embodied intelligence. Recent work leverages contrastive vision language models (VLMs) to obtain dense, domain-agnostic rewards without human supervision. These methods adapt VLMs into reward models through increasingly complex learning objectives, yet meaningful comparison remains difficult due to differences in training data, architectures, and evaluation settings. In this work, we isolate the impact of the learning objective by evaluating recent VLM-based reward models under a unified framework with identical backbones, finetuning data, and evaluation environments. Using Meta-World tasks, we assess modeling accuracy by measuring consistency with ground truth reward and correlation with expert progress. Remarkably, we show that a simple triplet loss outperforms state-of-the-art methods, suggesting that much of the improvements in recent approaches could be attributed to differences in data and architectures.

Related papers

Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact [3.437656066916039]
LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks.<n>This study evaluates the performance of leading foundation models with out-of-distribution tasks of the teaching and learning of schoolchildren.
arXiv Detail & Related papers (2026-03-01T03:05:46Z)
Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation [46.38008143057758]
Large Language Models (LLMs) demonstrate transformative potential, yet their reasoning remains inconsistent and unreliable.<n>This work argues that reward modeling is not merely an implementation detail but a central architect of reasoning alignment.<n>Within this framework, we present a taxonomy of reward mechanisms, analyze reward hacking as a pervasive failure mode, and examine how reward signals unify challenges.
arXiv Detail & Related papers (2026-02-10T00:45:24Z)
Reward Models are Metrics in a Trench Coat [8.100404050572996]
We find that the two research areas are mostly separate, leading to redundant terminology and repeated pitfalls.<n>Common challenges include susceptibility to spurious correlations, impact on downstream reward hacking, methods to improve data quality, and approaches to meta-evaluation.<n>Our position paper argues that a closer collaboration between the fields can help overcome these issues.
arXiv Detail & Related papers (2025-10-03T17:59:44Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Shortcut Learning Susceptibility in Vision Classifiers [11.599035626374409]
Shortcut learning is where machine learning models exploit spurious correlations in data instead of capturing meaningful features.<n>This study introduces deliberate shortcuts into the dataset that are correlated with class labels both positionally and via intensity.<n>We evaluate susceptibility to shortcut learning across different learning rates.
arXiv Detail & Related papers (2025-02-13T10:25:52Z)
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs) MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z)
Learning to Unlearn for Robust Machine Unlearning [6.488418950340473]
We introduce a novel Learning-to-Unlearn (LTU) framework to optimize the unlearning process. LTU includes a meta-optimization scheme that facilitates models to effectively preserve generalizable knowledge. We also introduce a Gradient Harmonization strategy to align the optimization trajectories for remembering and forgetting.
arXiv Detail & Related papers (2024-07-15T07:36:00Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Learning Objective-Specific Active Learning Strategies with Attentive Neural Processes [72.75421975804132]
Learning Active Learning (LAL) suggests to learn the active learning strategy itself, allowing it to adapt to the given setting. We propose a novel LAL method for classification that exploits symmetry and independence properties of the active learning problem. Our approach is based on learning from a myopic oracle, which gives our model the ability to adapt to non-standard objectives.
arXiv Detail & Related papers (2023-09-11T14:16:37Z)
ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP) ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective. We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z)
Imitation Learning by State-Only Distribution Matching [2.580765958706854]
Imitation Learning from observation describes policy learning in a similar way to human learning. We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric.
arXiv Detail & Related papers (2022-02-09T08:38:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.