RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation
- URL: http://arxiv.org/abs/2412.08591v2
- Date: Wed, 19 Mar 2025 10:05:05 GMT
- Title: RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation
- Authors: Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, Ivan Laptev,
- Abstract summary: We introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos.<n>RoomTour3D generates open-ended human walking trajectories and open-world navigable instructions.<n>We demonstrate experimentally that RoomTour3D enables significant improvements across multiple Vision-and-Language Navigation tasks.
- Score: 87.8836203762073
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes $\sim$100K open-ended description-enriched trajectories with $\sim$200K instructions, and 17K action-enriched trajectories from 1847 room tour environments. We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE. Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.
Related papers
- SpatialTrackerV2: 3D Point Tracking Made Easy [73.0350898700048]
SpatialTrackerV2 is a feed-forward 3D point tracking method for monocular videos.<n>It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion.<n>By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%.
arXiv Detail & Related papers (2025-07-16T17:59:03Z) - Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation [61.21302433849139]
Vision-and-Language Navigation (VLN) is a core task where embodied agents leverage their spatial mobility to navigate in 3D environments.<n>We propose Dynam3D, a dynamic layered 3D representation model that leverages language-aligned, generalizable, and hierarchical 3D representations as visual input to train 3D-VLM in navigation action prediction.<n>Our Dynam3D is capable of online encoding and localization of 3D instances, and dynamically updates them in changing environments to provide large-scale exploration and long-term memory capabilities for navigation.
arXiv Detail & Related papers (2025-05-16T15:46:27Z) - OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation [49.697035403548966]
Vision-Language Navigation (VLN) aims to guide agents through an environment by leveraging both language instructions and visual cues.
We propose OpenFly, a platform comprising a versatile toolchain and large-scale benchmark for aerial VLN.
We construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes.
The corresponding visual data are generated using various rendering engines and advanced techniques, including Unreal, GTA V, Google Earth, and 3D Splatting (3D GS)
arXiv Detail & Related papers (2025-02-25T09:57:18Z) - TAPVid-3D: A Benchmark for Tracking Any Point in 3D [63.060421798990845]
We introduce a new benchmark, TAPVid-3D, for evaluating the task of Tracking Any Point in 3D.
This benchmark will serve as a guidepost to improve our ability to understand precise 3D motion and surface deformation from monocular video.
arXiv Detail & Related papers (2024-07-08T13:28:47Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language
Navigation [59.3649071376364]
The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data.
We propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions.
VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate.
arXiv Detail & Related papers (2024-02-05T22:20:19Z) - 3D-Aware Object Goal Navigation via Simultaneous Exploration and
Identification [19.125633699422117]
We propose a framework for 3D-aware ObjectNav based on two straightforward sub-policies.
Our framework achieves the best performance among all modular-based methods on the Matterport3D and Gibson datasets.
arXiv Detail & Related papers (2022-12-01T07:55:56Z) - 3D Visual Tracking Framework with Deep Learning for Asteroid Exploration [22.808962211830675]
In this paper, we focus on the studied accurate and real-time method for 3D tracking.
A new large-scale 3D asteroid tracking dataset is presented, including binocular video sequences, depth maps, and point clouds of diverse asteroids.
We propose a deep-learning based 3D tracking framework, named as Track3D, which involves 2D monocular tracker and a novel light-weight amodal axis-aligned bounding-box network, A3BoxNet.
arXiv Detail & Related papers (2021-11-21T04:14:45Z) - PC-DAN: Point Cloud based Deep Affinity Network for 3D Multi-Object
Tracking (Accepted as an extended abstract in JRDB-ACT Workshop at CVPR21) [68.12101204123422]
A point cloud is a dense compilation of spatial data in 3D coordinates.
We propose a PointNet-based approach for 3D Multi-Object Tracking (MOT)
arXiv Detail & Related papers (2021-06-03T05:36:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.