Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning
- URL: http://arxiv.org/abs/2502.01616v1
- Date: Mon, 03 Feb 2025 18:50:15 GMT
- Title: Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning
- Authors: Udita Ghosh, Dripta S. Raychaudhuri, Jiachen Li, Konstantinos Karydis, Amit Roy-Chowdhury,
- Abstract summary: We introduce PrefVLM, a framework that integrates Vision-Language Models (VLMs) with selective human feedback.<n>Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation.<n> Experiments on Meta-World manipulation tasks demonstrate that PrefVLM achieves comparable or superior success rates to state-of-the-art methods.
- Score: 17.59802090014789
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Preference-based reinforcement learning (RL) offers a promising approach for aligning policies with human intent but is often constrained by the high cost of human feedback. In this work, we introduce PrefVLM, a framework that integrates Vision-Language Models (VLMs) with selective human feedback to significantly reduce annotation requirements while maintaining performance. Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation. Additionally, we adapt VLMs using a self-supervised inverse dynamics loss to improve alignment with evolving policies. Experiments on Meta-World manipulation tasks demonstrate that PrefVLM achieves comparable or superior success rates to state-of-the-art methods while using up to 2 x fewer human annotations. Furthermore, we show that adapted VLMs enable efficient knowledge transfer across tasks, further minimizing feedback needs. Our results highlight the potential of combining VLMs with selective human supervision to make preference-based RL more scalable and practical.
Related papers
- HF4Rec: Human-Like Feedback-Driven Optimization Framework for Explainable Recommendation [8.532115411106068]
We propose a novel human-like feedback-driven optimization framework for explainable recommendations.
This framework employs a dynamic interactive optimization mechanism for achieving human-centered explainable requirements without incurring high labor costs.
In particular, we propose to utilize large language models (LLMs) as human simulators to predict human-like feedback for guiding the learning process.
arXiv Detail & Related papers (2025-04-19T02:46:10Z) - When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making [15.397582422113627]
Embodied decision-making is fundamental for AI agents operating in real-world environments.
In this study, we evaluate open-sourced Visual Language Models (VLMs) on multimodal human-centered decision-making tasks.
arXiv Detail & Related papers (2025-03-21T09:25:23Z) - From Captions to Rewards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Models [58.16075709485292]
CAREVL is a novel method for preference reward modeling by reliably using both high- and low-confidence data.
CAREVL achieves performance improvements over traditional distillation-based methods on VL-RewardBench and MLLM-as-a-Judge benchmark.
arXiv Detail & Related papers (2025-03-08T16:13:18Z) - OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [80.36831779302148]
Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities.
This paper introduces OmniAlign-V, a dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats.
Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment.
arXiv Detail & Related papers (2025-02-25T18:05:14Z) - OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.<n>We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models [44.82179903133343]
Large Language Models (LLMs) act as implicits for Vision-Langugage Models (VLMs)<n>Our GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts.<n>We evaluate our GLOV on 16 diverse datasets using two families of VLMs.
arXiv Detail & Related papers (2024-10-08T15:55:40Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)<n>We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - Improve Temporal Awareness of LLMs for Sequential Recommendation [61.723928508200196]
Large language models (LLMs) have demonstrated impressive zero-shot abilities in solving a wide range of general-purpose tasks.
LLMs fall short in recognizing and utilizing temporal information, rendering poor performance in tasks that require an understanding of sequential data.
We propose three prompting strategies to exploit temporal information within historical interactions for LLM-based sequential recommendation.
arXiv Detail & Related papers (2024-05-05T00:21:26Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game [31.66896160733569]
We propose an Adversarial Preference Optimization (APO) framework to target more efficient human preference optimization.
We find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness.
arXiv Detail & Related papers (2023-11-14T10:10:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.