DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation
- URL: http://arxiv.org/abs/2308.07498v1
- Date: Mon, 14 Aug 2023 23:45:01 GMT
- Title: DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation
- Authors: Hanqing Wang, Wei Liang, Luc Van Gool, Wenguan Wang
- Abstract summary: We propose DREAMWALKER -- a world model based VLN-CE agent.
The world model is built to summarize the visual, topological, and dynamic properties of the complicated continuous environment.
It can simulate and evaluate possible plans entirely in such internal abstract world, before executing costly actions.
- Score: 107.5934592892763
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: VLN-CE is a recently released embodied task, where AI agents need to navigate
a freely traversable environment to reach a distant target location, given
language instructions. It poses great challenges due to the huge space of
possible strategies. Driven by the belief that the ability to anticipate the
consequences of future actions is crucial for the emergence of intelligent and
interpretable planning behavior, we propose DREAMWALKER -- a world model based
VLN-CE agent. The world model is built to summarize the visual, topological,
and dynamic properties of the complicated continuous environment into a
discrete, structured, and compact representation. DREAMWALKER can simulate and
evaluate possible plans entirely in such internal abstract world, before
executing costly actions. As opposed to existing model-free VLN-CE agents
simply making greedy decisions in the real world, which easily results in
shortsighted behaviors, DREAMWALKER is able to make strategic planning through
large amounts of ``mental experiments.'' Moreover, the imagined future
scenarios reflect our agent's intention, making its decision-making process
more transparent. Extensive experiments and ablation studies on VLN-CE dataset
confirm the effectiveness of the proposed approach and outline fruitful
directions for future work.
Related papers
- Learning World Models for Unconstrained Goal Navigation [4.549550797148707]
We introduce a goal-directed exploration algorithm, MUN, for learning world models.
MUN is capable of modeling state transitions between arbitrary subgoal states in the replay buffer.
Results demonstrate that MUN strengthens the reliability of world models and significantly improves the policy's capacity to generalize.
arXiv Detail & Related papers (2024-11-03T01:35:06Z) - Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation [25.26545170310844]
We present a World-model-augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making.
Experiments on WebArena and Mind2Web show that our world models improve agents' policy selection without training.
arXiv Detail & Related papers (2024-10-17T05:37:00Z) - Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models [85.55649666025926]
We introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities.
Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans.
We propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans.
arXiv Detail & Related papers (2024-09-22T00:30:11Z) - LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds.
Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines.
We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z) - Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z) - HAZARD Challenge: Embodied Decision Making in Dynamically Changing
Environments [93.94020724735199]
HAZARD consists of three unexpected disaster scenarios, including fire, flood, and wind.
This benchmark enables us to evaluate autonomous agents' decision-making capabilities across various pipelines.
arXiv Detail & Related papers (2024-01-23T18:59:43Z) - Look Before You Leap: Unveiling the Power of GPT-4V in Robotic
Vision-Language Planning [32.045840007623276]
We introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning.
ViLa directly integrates perceptual data into its reasoning and planning process.
Our evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners.
arXiv Detail & Related papers (2023-11-29T17:46:25Z) - Curious Exploration via Structured World Models Yields Zero-Shot Object
Manipulation [19.840186443344]
We propose to use structured world models to incorporate inductive biases in the control loop to achieve sample-efficient exploration.
Our method generates free-play behavior that starts to interact with objects early on and develops more complex behavior over time.
arXiv Detail & Related papers (2022-06-22T22:08:50Z) - Procedure Planning in Instructional Videosvia Contextual Modeling and
Model-based Policy Learning [114.1830997893756]
This work focuses on learning a model to plan goal-directed actions in real-life videos.
We propose novel algorithms to model human behaviors through Bayesian Inference and model-based Imitation Learning.
arXiv Detail & Related papers (2021-10-05T01:06:53Z) - World Model as a Graph: Learning Latent Landmarks for Planning [12.239590266108115]
Planning is a hallmark of human intelligence.
One prominent framework, Model-Based RL, learns a world model and plans using step-by-step virtual rollouts.
We propose to learn graph-structured world models composed of sparse, multi-step transitions.
arXiv Detail & Related papers (2020-11-25T02:49:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.