Related papers: GoalLadder: Incremental Goal Discovery with Vision-Language Models

GoalLadder: Incremental Goal Discovery with Vision-Language Models

URL: http://arxiv.org/abs/2506.16396v1
Date: Thu, 19 Jun 2025 15:28:27 GMT
Title: GoalLadder: Incremental Goal Discovery with Vision-Language Models
Authors: Alexey Zakharov, Shimon Whiteson,
Abstract summary: We propose a novel method to train RL agents from a single language instruction in visual environments.<n>GoalLadder works by incrementally discovering states that bring the agent closer to completing a task specified in natural language.<n>Unlike prior work, GoalLadder does not trust VLM's feedback completely; instead, it uses it to rank potential goal states using an ELO-based rating system.
Score: 38.35578010611503
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Natural language can offer a concise and human-interpretable means of specifying reinforcement learning (RL) tasks. The ability to extract rewards from a language instruction can enable the development of robotic systems that can learn from human guidance; however, it remains a challenging problem, especially in visual environments. Existing approaches that employ large, pretrained language models either rely on non-visual environment representations, require prohibitively large amounts of feedback, or generate noisy, ill-shaped reward functions. In this paper, we propose a novel method, $\textbf{GoalLadder}$, that leverages vision-language models (VLMs) to train RL agents from a single language instruction in visual environments. GoalLadder works by incrementally discovering states that bring the agent closer to completing a task specified in natural language. To do so, it queries a VLM to identify states that represent an improvement in agent's task progress and to rank them using pairwise comparisons. Unlike prior work, GoalLadder does not trust VLM's feedback completely; instead, it uses it to rank potential goal states using an ELO-based rating system, thus reducing the detrimental effects of noisy VLM feedback. Over the course of training, the agent is tasked with minimising the distance to the top-ranked goal in a learned embedding space, which is trained on unlabelled visual data. This key feature allows us to bypass the need for abundant and accurate feedback typically required to train a well-shaped reward function. We demonstrate that GoalLadder outperforms existing related methods on classic control and robotic manipulation environments with the average final success rate of $\sim$95% compared to only $\sim$45% of the best competitor.

Related papers

RLZero: Direct Policy Inference from Language Without In-Domain Supervision [40.046873614139464]
Natural language offers an intuitive alternative for instructing reinforcement learning agents.<n>We present a new approach that uses a pretrained RL agent trained using unlabeled, offline interactions.<n>We show that components of RL can be used to generate policies zeroshot from cross-embodied videos.
arXiv Detail & Related papers (2024-12-07T18:31:16Z)
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models [7.704773649029078]
Vision-language models (VLMs) have tremendous potential for grounding language.<n>This paper introduces a novel decomposition of the problem of building language-conditioned agents (LCAs)<n>We also explore several enhancements to the speed and quality of VLM-based LCAs.
arXiv Detail & Related papers (2024-09-24T12:24:07Z)
Visual Grounding for Object-Level Generalization in Reinforcement Learning [35.39214541324909]
Generalization is a pivotal challenge for agents following natural language instructions. We leverage a vision-language model (VLM) for visual grounding and transfer its vision-language knowledge into reinforcement learning. We show that our intrinsic reward significantly improves performance on challenging skill learning.
arXiv Detail & Related papers (2024-08-04T06:34:24Z)
Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement [93.73648674743097]
Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced.
arXiv Detail & Related papers (2024-04-06T13:25:00Z)
Yell At Your Robot: Improving On-the-Fly from Language Corrections [84.09578841663195]
We show that high-level policies can be readily supervised with human feedback in the form of language corrections. This framework enables robots not only to rapidly adapt to real-time language feedback, but also incorporate this feedback into an iterative training scheme.
arXiv Detail & Related papers (2024-03-19T17:08:24Z)
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback [24.759613248409167]
Reward engineering has long been a challenge in Reinforcement Learning research. We propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks. We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains.
arXiv Detail & Related papers (2024-02-06T04:06:06Z)
Vision-Language Models as a Source of Rewards [68.52824755339806]
We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals.
arXiv Detail & Related papers (2023-12-14T18:06:17Z)
Guiding Pretraining in Reinforcement Learning with Large Language Models [133.32146904055233]
We describe a method that uses background knowledge from text corpora to shape exploration. This method, called ELLM, rewards an agent for achieving goals suggested by a language model. By leveraging large-scale language model pretraining, ELLM guides agents toward human-meaningful and plausibly useful behaviors without requiring a human in the loop.
arXiv Detail & Related papers (2023-02-13T21:16:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.