WoMAP: World Models For Embodied Open-Vocabulary Object Localization
- URL: http://arxiv.org/abs/2506.01600v1
- Date: Mon, 02 Jun 2025 12:35:14 GMT
- Title: WoMAP: World Models For Embodied Open-Vocabulary Object Localization
- Authors: Tenny Yin, Zhiting Mei, Tao Sun, Lihan Zha, Emily Zhou, Jeremy Bao, Miyu Yamane, Ola Shorinwa, Anirudha Majumdar,
- Abstract summary: WoMAP (World Models for Active Perception) is a recipe for training open-vocabulary object localization policies.<n>We show that WoMAP achieves strong generalization and sim-to-real transfer on a TidyBot.
- Score: 8.947213246332764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language-instructed active object localization is a critical challenge for robots, requiring efficient exploration of partially observable environments. However, state-of-the-art approaches either struggle to generalize beyond demonstration datasets (e.g., imitation learning methods) or fail to generate physically grounded actions (e.g., VLMs). To address these limitations, we introduce WoMAP (World Models for Active Perception): a recipe for training open-vocabulary object localization policies that: (i) uses a Gaussian Splatting-based real-to-sim-to-real pipeline for scalable data generation without the need for expert demonstrations, (ii) distills dense rewards signals from open-vocabulary object detectors, and (iii) leverages a latent world model for dynamics and rewards prediction to ground high-level action proposals at inference time. Rigorous simulation and hardware experiments demonstrate WoMAP's superior performance in a broad range of zero-shot object localization tasks, with more than 9x and 2x higher success rates compared to VLM and diffusion policy baselines, respectively. Further, we show that WoMAP achieves strong generalization and sim-to-real transfer on a TidyBot.
Related papers
- Topology-Aware Modeling for Unsupervised Simulation-to-Reality Point Cloud Recognition [63.55828203989405]
We introduce a novel Topology-Aware Modeling (TAM) framework for Sim2Real UDA on object point clouds.<n>Our approach mitigates the domain gap by leveraging global spatial topology, characterized by low-level, high-frequency 3D structures.<n>We propose an advanced self-training strategy that combines cross-domain contrastive learning with self-training.
arXiv Detail & Related papers (2025-06-26T11:53:59Z) - AnyPlace: Learning Generalized Object Placement for Robot Manipulation [37.725807003481904]
We propose AnyPlace, a two-stage method trained entirely on synthetic data.<n>Our key insight is that by leveraging a Vision-Language Model, we focus only on the relevant regions for local placement.<n>For training, we generate a fully synthetic dataset of randomly generated objects in different placement configurations.<n>In real-world experiments, we show how our approach directly transfers models trained purely on synthetic data to the real world.
arXiv Detail & Related papers (2025-02-06T22:04:13Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - E2Map: Experience-and-Emotion Map for Self-Reflective Robot Navigation with Language Models [16.50787220881633]
Large language models (LLMs) have shown significant potential in guiding embodied agents to execute language instructions.<n>Existing methods are primarily designed for static environments and do not leverage agent's own experiences to refine its initial plans.<n>This study introduces the Experience-and-Emotion Map (E2Map), which not only integrates LLM knowledge but also the agent's real-world experiences.
arXiv Detail & Related papers (2024-09-16T06:35:18Z) - Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community [58.417475846791234]
We propose and train the novel LAE-DINO Model, the first open-vocabulary foundation object detector for the LAE task.<n>We conduct experiments on established remote sensing benchmark DIOR, DOTAv2.0, as well as our newly introduced 80-class LAE-80C benchmark.<n>Results demonstrate the advantages of the LAE-1M dataset and the effectiveness of the LAE-DINO method.
arXiv Detail & Related papers (2024-08-17T06:24:43Z) - YOLO-World: Real-Time Open-Vocabulary Object Detection [87.08732047660058]
We introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities.
Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency.
YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed.
arXiv Detail & Related papers (2024-01-30T18:59:38Z) - Recognize Any Regions [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.<n>Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z) - Background Activation Suppression for Weakly Supervised Object
Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels.
New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization.
This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z) - SEAL: Simultaneous Exploration and Localization in Multi-Robot Systems [0.0]
This paper proposes a novel simultaneous exploration and localization approach.
It uses information fusion for maximum exploration while performing communication graph optimization for relative localization.
SEAL outperformed cutting-edge methods on exploration and localization performance in extensive ROS-Gazebo simulations.
arXiv Detail & Related papers (2023-06-22T01:27:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.