Related papers: Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning

Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning

URL: http://arxiv.org/abs/2504.00907v2
Date: Wed, 02 Apr 2025 01:49:21 GMT
Title: Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning
Authors: Ram Ramrakhya, Matthew Chang, Xavier Puig, Ruta Desai, Zsolt Kira, Roozbeh Mottaghi,
Abstract summary: Embodied agents operating in real-world environments must interpret ambiguous and under-specified human instructions.<n>We introduce the Ask-to-Act task, where an embodied agent must fetch a specific object instance given an ambiguous instruction in a home environment.<n>We propose a novel approach that fine-tunes multimodal large language models (MLLMs) as vision-language-action (VLA) policies using online reinforcement learning (RL) with LLM-generated rewards.
Score: 48.098838027631494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Embodied agents operating in real-world environments must interpret ambiguous and under-specified human instructions. A capable household robot should recognize ambiguity and ask relevant clarification questions to infer the user intent accurately, leading to more effective task execution. To study this problem, we introduce the Ask-to-Act task, where an embodied agent must fetch a specific object instance given an ambiguous instruction in a home environment. The agent must strategically ask minimal, yet relevant, clarification questions to resolve ambiguity while navigating under partial observability. To solve this problem, we propose a novel approach that fine-tunes multimodal large language models (MLLMs) as vision-language-action (VLA) policies using online reinforcement learning (RL) with LLM-generated rewards. Our method eliminates the need for large-scale human demonstrations or manually engineered rewards for training such agents. We benchmark against strong zero-shot baselines, including GPT-4o, and supervised fine-tuned MLLMs, on our task. Our results demonstrate that our RL-finetuned MLLM outperforms all baselines by a significant margin ($19.1$-$40.3\%$), generalizing well to novel scenes and tasks. To the best of our knowledge, this is the first demonstration of adapting MLLMs as VLA agents that can act and ask for help using LLM-generated rewards with online RL.

Related papers

Multi-Agent Evolve: LLM Self-Improve through Co-evolution [53.00458074754831]
Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs)<n>Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data.<n>We propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A.
arXiv Detail & Related papers (2025-10-27T17:58:02Z)
Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL [62.984693936073974]
Large language models (LLMs) excel in tasks like question answering and dialogue.<n>Complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning.<n>We propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents.
arXiv Detail & Related papers (2025-05-23T16:51:54Z)
VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making [45.02997774119763]
Vision-language models (VLMs) extend large language models (LLMs) to multi-modal data.<n>Our work approaches these challenges from an offline-to-online reinforcement learning (RL) perspective.
arXiv Detail & Related papers (2025-05-06T04:51:57Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks. This vulnerability poses significant risks to real-world applications. We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z)
From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z)
Reinforcement Learning Problem Solving with Large Language Models [0.0]
Large Language Models (LLMs) have an extensive amount of world knowledge, and this has enabled their application in various domains to improve the performance of Natural Language Processing (NLP) tasks. This has also facilitated a more accessible paradigm of conversation-based interactions between humans and AI systems to solve intended problems. We show the practicality of our approach through two detailed case studies for "Research Scientist" and "Legal Matter Intake"
arXiv Detail & Related papers (2024-04-29T12:16:08Z)
EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents [65.38474102119181]
We propose EnvGen, a framework to adaptively create training environments. We train a small RL agent in a mixture of the original and LLM-generated environments. We find that a small RL agent trained with EnvGen can outperform SOTA methods, including a GPT-4 agent, and learns long-horizon tasks significantly faster.
arXiv Detail & Related papers (2024-03-18T17:51:16Z)
Using Large Language Models for Natural Language Processing Tasks in Requirements Engineering: A Systematic Guideline [2.6644624823848426]
Large Language Models (LLMs) are the cornerstone in automating Requirements Engineering (RE) tasks. This chapter aims to furnish readers with essential knowledge about LLMs in its initial segment. It provides a comprehensive guideline tailored for students, researchers, and practitioners on harnessing LLMs to address their specific objectives.
arXiv Detail & Related papers (2024-02-21T14:00:52Z)
Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents [16.24662355253529]
Large Language Models (LLMs) can address sequential decision-making tasks through the provision of high-level instructions. LLMs lack specialization in tackling specific target problems, particularly in real-time dynamic environments. We introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent.
arXiv Detail & Related papers (2023-11-22T13:15:42Z)
LLM Augmented Hierarchical Agents [4.574041097539858]
Solving long-horizon, temporally-extended tasks using Reinforcement Learning (RL) is challenging, compounded by the common practice of learning without prior knowledge (or tabula rasa learning) In this paper we exploit the planning capabilities of LLMs while using RL to provide learning from the environment, resulting in a hierarchical agent that uses LLMs to solve long-horizon tasks. This approach is evaluated in simulation environments such as MiniGrid, SkillHack, and Crafter, and on a real robot arm in block manipulation tasks.
arXiv Detail & Related papers (2023-11-09T18:54:28Z)
LgTS: Dynamic Task Sampling using LLM-generated sub-goals for Reinforcement Learning Agents [10.936460061405157]
We propose LgTS (LLM-guided Teacher-Student learning), a novel approach that explores the planning abilities of LLMs. Our approach does not assume access to a propreitary or a fine-tuned LLM, nor does it require pre-trained policies that achieve the sub-goals proposed by the LLM.
arXiv Detail & Related papers (2023-10-14T00:07:03Z)
Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach [31.6589518077397]
Large language models (LLMs) encode a vast amount of world knowledge acquired from massive text datasets. LLMs can assist an embodied agent in solving complex sequential decision making tasks by providing high-level instructions. We propose When2Ask, a reinforcement learning based approach that learns when it is necessary to query LLMs for high-level instructions.
arXiv Detail & Related papers (2023-06-06T11:49:09Z)
Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling [101.59430768507997]
Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world. We propose using few-shot large language models (LLMs) to hypothesize an Abstract World Model (AWM) Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude.
arXiv Detail & Related papers (2023-01-28T02:04:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.