Related papers: Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models

Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models

URL: http://arxiv.org/abs/2505.07815v2
Date: Fri, 13 Jun 2025 22:39:23 GMT
Title: Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models
Authors: Seungjae Lee, Daniel Ekpo, Haowen Liu, Furong Huang, Abhinav Shrivastava, Jia-Bin Huang,
Abstract summary: We present IVE, an agentic exploration framework inspired by human curiosity.<n>We evaluate IVE in both simulated and real-world tabletop environments.
Score: 60.675955082094944
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Exploration is essential for general-purpose robotic learning, especially in open-ended environments where dense rewards, explicit goals, or task-specific supervision are scarce. Vision-language models (VLMs), with their semantic reasoning over objects, spatial relations, and potential outcomes, present a compelling foundation for generating high-level exploratory behaviors. However, their outputs are often ungrounded, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration is often driven by the desire to discover novel scene configurations and to deepen understanding of the environment. Similarly, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE enables more diverse and meaningful exploration than RL baselines, as evidenced by a 4.1 to 7.8x increase in the entropy of visited states. Moreover, the collected experience supports downstream learning, producing policies that closely match or exceed the performance of those trained on human-collected demonstrations.

Related papers

Emergent Active Perception and Dexterity of Simulated Humanoids from Visual Reinforcement Learning [69.71072181304066]
We introduce Perceptive Dexterous Control (PDC), a framework for vision-driven whole-body control with simulated humanoids.<n>PDC operates solely on egocentric vision for task specification, enabling object search, target placement, and skill selection through visual cues.<n>We show that training from scratch with reinforcement learning can produce emergent behaviors such as active search.
arXiv Detail & Related papers (2025-05-18T07:33:31Z)
ForesightNav: Learning Scene Imagination for Efficient Exploration [57.49417653636244]
We propose ForesightNav, a novel exploration strategy inspired by human imagination and reasoning.<n>Our approach equips robotic agents with the capability to predict contextual information, such as occupancy and semantic details, for unexplored regions.<n>We validate our imagination-based approach using the Structured3D dataset, demonstrating accurate occupancy prediction and superior performance in anticipating unseen scene geometry.
arXiv Detail & Related papers (2025-04-22T17:38:38Z)
SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models [22.96777963013918]
Intrinsic motivation attempts to decouple exploration from external, task-based rewards.<n>SENSEI is a framework to equip model-based RL agents with an intrinsic motivation for semantically meaningful behavior.
arXiv Detail & Related papers (2025-03-03T14:26:15Z)
Generative agents in the streets: Exploring the use of Large Language Models (LLMs) in collecting urban perceptions [0.0]
This study explores the current advancements in Generative agents powered by large language models (LLMs) The experiment employs Generative agents to interact with the urban environments using street view images to plan their journey toward specific goals. Since LLMs do not possess embodiment, nor have access to the visual realm, and lack a sense of motion or direction, we designed movement and visual modules that help agents gain an overall understanding of surroundings.
arXiv Detail & Related papers (2023-12-20T15:45:54Z)
CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation [73.78984332354636]
CorNav is a novel zero-shot framework for vision-and-language navigation. It incorporates environmental feedback for refining future plans and adjusting its actions. It consistently outperforms all baselines in a zero-shot multi-task setting.
arXiv Detail & Related papers (2023-06-17T11:44:04Z)
Embodied Agents for Efficient Exploration and Smart Scene Description [47.82947878753809]
We tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment. We propose and evaluate an approach that combines recent advances in visual robotic exploration and image captioning on images. Our approach can generate smart scene descriptions that maximize semantic knowledge of the environment and avoid repetitions.
arXiv Detail & Related papers (2023-01-17T19:28:01Z)
Semantic Exploration from Language Abstractions and Pretrained Representations [23.02024937564099]
Effective exploration is a challenge in reinforcement learning (RL) We define novelty using semantically meaningful state abstractions. We evaluate vision-language representations, pretrained on natural image captioning datasets.
arXiv Detail & Related papers (2022-04-08T17:08:00Z)
Batch Exploration with Examples for Scalable Robotic Reinforcement Learning [63.552788688544254]
Batch Exploration with Examples (BEE) explores relevant regions of the state-space guided by a modest number of human provided images of important states. BEE is able to tackle challenging vision-based manipulation tasks both in simulation and on a real Franka robot.
arXiv Detail & Related papers (2020-10-22T17:49:25Z)
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal.<n>One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments.<n>We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.