Related papers: Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

URL: http://arxiv.org/abs/2310.12921v2
Date: Thu, 14 Mar 2024 12:16:00 GMT
Title: Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
Authors: Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David Lindner,
Abstract summary: Reinforcement learning (RL) requires either manually specifying a reward function, or learning a reward model from a large amount of human feedback. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language.
Score: 12.628697648945298
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/vlm-rm. We can improve performance by providing a second "baseline" prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.

Related papers

Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM [27.081774497698667]
We propose a novel collaborative framework named Small-Large Collaboration (SLC) for large VLM personalization.<n>We develop a test-time reflection strategy, preventing the potential hallucination of the small VLM.<n>To the best of our knowledge, this is the first training-efficient framework that supports both open-source and closed-source large VLMs.
arXiv Detail & Related papers (2025-08-10T09:24:31Z)
VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making [45.02997774119763]
Vision-language models (VLMs) extend large language models (LLMs) to multi-modal data.<n>Our work approaches these challenges from an offline-to-online reinforcement learning (RL) perspective.
arXiv Detail & Related papers (2025-05-06T04:51:57Z)
Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning [17.59802090014789]
We introduce PrefVLM, a framework that integrates Vision-Language Models (VLMs) with selective human feedback. Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation. Experiments on Meta-World manipulation tasks demonstrate that PrefVLM achieves comparable or superior success rates to state-of-the-art methods.
arXiv Detail & Related papers (2025-02-03T18:50:15Z)
Mordal: Automated Pretrained Model Selection for Vision Language Models [4.339232569078834]
Mordal is an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Our evaluation shows that Mordal can find the best VLM for a given problem using up to $8.9times$--$11.6times$ lower GPU hours than grid search.
arXiv Detail & Related papers (2025-02-01T00:41:29Z)
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback [52.763620660061115]
ONI is a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function.<n>We explore a range of algorithmic choices for reward modeling with varying complexity.<n>Our approach achieves state-of-the-art performance across a range of challenging tasks from the NetHack Learning Environment.
arXiv Detail & Related papers (2024-10-30T13:52:43Z)
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
Are Bigger Encoders Always Better in Vision Large Models? [21.797332686137203]
multimodal large language models (MLLMs) have shown strong potential in real-world applications. The scaling trend of vision language models (VLMs) under the current mainstream paradigm has not been extensively studied. We conduct experiments on the pretraining stage of MLLMs using different encoder sizes and large language model (LLM) sizes.
arXiv Detail & Related papers (2024-08-01T15:05:42Z)
FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning [18.60627708199452]
We investigate how to leverage pre-trained visual-language models (VLM) for online Reinforcement Learning (RL) We first identify the problem of reward misalignment when applying VLM as a reward in RL tasks. We introduce a lightweight fine-tuning method, named Fuzzy VLM reward-aided RL (FuRL)
arXiv Detail & Related papers (2024-06-02T07:20:08Z)
An Introduction to Vision-Language Modeling [128.6223984157515]
The vision-language model (VLM) applications will significantly impact our relationship with technology. We introduce what VLMs are, how they work, and how to train them. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
arXiv Detail & Related papers (2024-05-27T15:01:23Z)
Code as Reward: Empowering Reinforcement Learning with VLMs [37.862999288331906]
We propose a framework named Code as Reward (VLM-CaR) to produce dense reward functions from pre-trained Vision-Language Models. VLM-CaR significantly reduces the computational burden of querying the VLM directly. We show that the dense rewards generated through our approach are very accurate across a diverse set of discrete and continuous environments.
arXiv Detail & Related papers (2024-02-07T11:27:45Z)
Large Language Models are Visual Reasoning Coordinators [144.67558375045755]
We propose a novel paradigm that coordinates multiple vision-language models for visual reasoning. We show that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering. We also show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings.
arXiv Detail & Related papers (2023-10-23T17:59:31Z)
Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z)
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models [76.410400238974]
We propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident. A CLIP model is adopted as the reward model during TTA and provides feedback for the VLM. The proposed textitreinforcement learning with CLIP feedback(RLCF) framework is highly flexible and universal.
arXiv Detail & Related papers (2023-05-29T11:03:59Z)
Reinforcement Learning Friendly Vision-Language Model for Minecraft [31.863271032186038]
We propose a novel cross-modal contrastive learning framework architecture, CLIP4MC. We aim to learn a reinforcement learning (RL) friendly vision-language model (VLM) that serves as an intrinsic reward function for open-ended tasks. We demonstrate that the proposed method achieves better performance on RL tasks compared with baselines.
arXiv Detail & Related papers (2023-03-19T05:20:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.