RLZero: Direct Policy Inference from Language Without In-Domain Supervision
- URL: http://arxiv.org/abs/2412.05718v2
- Date: Sun, 01 Jun 2025 15:15:39 GMT
- Title: RLZero: Direct Policy Inference from Language Without In-Domain Supervision
- Authors: Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, Scott Niekum,
- Abstract summary: Natural language offers an intuitive alternative for instructing reinforcement learning agents.<n>We present a new approach that uses a pretrained RL agent trained using unlabeled, offline interactions.<n>We show that components of RL can be used to generate policies zeroshot from cross-embodied videos.
- Score: 40.046873614139464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions--without task-specific supervision or labeled trajectories--to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the agent imagines a sequence of observations corresponding to the provided language description using video generative models. Next, these imagined observations are projected into the target environment domain. Finally, an agent pretrained in the target environment with unsupervised RL instantly imitates the projected observation sequence through a closed-form solution. To the best of our knowledge, our method, RLZero, is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision. We further show that components of RLZero can be used to generate policies zero-shot from cross-embodied videos, such as those available on YouTube, even for complex embodiments like humanoids.
Related papers
- GoalLadder: Incremental Goal Discovery with Vision-Language Models [38.35578010611503]
We propose a novel method to train RL agents from a single language instruction in visual environments.<n>GoalLadder works by incrementally discovering states that bring the agent closer to completing a task specified in natural language.<n>Unlike prior work, GoalLadder does not trust VLM's feedback completely; instead, it uses it to rank potential goal states using an ELO-based rating system.
arXiv Detail & Related papers (2025-06-19T15:28:27Z) - RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.96848846966087]
Training large language models (LLMs) as interactive agents presents unique challenges.<n>While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.<n>We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z) - Text-Aware Diffusion for Policy Learning [8.32790576855495]
We propose Text-Aware Diffusion for Policy Learning (TADPoLe), which uses a pretrained, frozen text-conditioned diffusion model to compute dense zero-shot reward signals for text-aligned policy learning.
We show that TADPoLe is able to learn policies for novel goal-achievement and continuous locomotion behaviors specified by natural language, in both Humanoid and Dog environments.
arXiv Detail & Related papers (2024-07-02T03:08:20Z) - Unsupervised Zero-Shot Reinforcement Learning via Functional Reward
Encodings [107.1837163643886]
We present a functional reward encoding (FRE) as a general, scalable solution to this zero-shot RL problem.
Our main idea is to learn functional representations of any arbitrary tasks by encoding their state-reward samples.
We empirically show that FRE agents trained on diverse random unsupervised reward functions can generalize to solve novel tasks.
arXiv Detail & Related papers (2024-02-27T01:59:02Z) - Foundation Policies with Hilbert Representations [54.44869979017766]
We propose an unsupervised framework to pre-train generalist policies from unlabeled offline data.
Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment.
Our experiments show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion.
arXiv Detail & Related papers (2024-02-23T19:09:10Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - Goal Representations for Instruction Following: A Semi-Supervised
Language Interface to Control [58.06223121654735]
We show a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data.
Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image.
We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data.
arXiv Detail & Related papers (2023-06-30T20:09:39Z) - GRILL: Grounded Vision-language Pre-training via Aligning Text and Image
Regions [92.96783800362886]
Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks.
We introduce GRILL, a novel VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks with no or very few training instances.
arXiv Detail & Related papers (2023-05-24T03:33:21Z) - Reward Design with Language Models [27.24197025688919]
Reward design in reinforcement learning (RL) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require expert demonstrations.
Can we instead cheaply design rewards using a natural language interface?
This paper explores how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function.
arXiv Detail & Related papers (2023-02-27T22:09:35Z) - Guiding Pretraining in Reinforcement Learning with Large Language Models [133.32146904055233]
We describe a method that uses background knowledge from text corpora to shape exploration.
This method, called ELLM, rewards an agent for achieving goals suggested by a language model.
By leveraging large-scale language model pretraining, ELLM guides agents toward human-meaningful and plausibly useful behaviors without requiring a human in the loop.
arXiv Detail & Related papers (2023-02-13T21:16:03Z) - Distilling Internet-Scale Vision-Language Models into Embodied Agents [24.71298634838615]
We propose using pretrained vision-language models (VLMs) to supervise embodied agents.
We combine ideas from model distillation and hindsight experience replay (HER) to retroactively generate language describing the agent's behavior.
Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.
arXiv Detail & Related papers (2023-01-29T18:21:05Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z) - Human Instruction-Following with Deep Reinforcement Learning via
Transfer-Learning from Text [12.88819706338837]
Recent work has described neural-network-based agents that are trained with reinforcement learning to execute language-like commands in simulated worlds.
We propose a conceptually simple method for training instruction-following agents with deep RL that are robust to natural human instructions.
arXiv Detail & Related papers (2020-05-19T12:16:58Z) - On the interaction between supervision and self-play in emergent
communication [82.290338507106]
We investigate the relationship between two categories of learning signals with the ultimate goal of improving sample efficiency.
We find that first training agents via supervised learning on human data followed by self-play outperforms the converse.
arXiv Detail & Related papers (2020-02-04T02:35:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.