Multimodal LLM Guided Exploration and Active Mapping using Fisher Information
- URL: http://arxiv.org/abs/2410.17422v2
- Date: Wed, 04 Dec 2024 22:03:08 GMT
- Title: Multimodal LLM Guided Exploration and Active Mapping using Fisher Information
- Authors: Wen Jiang, Boshu Lei, Katrina Ashton, Kostas Daniilidis,
- Abstract summary: We present an active mapping system that could plan for long-horizon exploration goals and short-term actions with a 3D Gaussian Splatting representation.
Experiments conducted on the Gibson and Habitat-Matterport 3D datasets demonstrate state-of-the-art results of the proposed method.
- Score: 26.602364433232445
- License:
- Abstract: We present an active mapping system that could plan for long-horizon exploration goals and short-term actions with a 3D Gaussian Splatting (3DGS) representation. Existing methods either did not take advantage of recent developments in multimodal Large Language Models (LLM) or did not consider challenges in localization uncertainty, which is critical in embodied agents. We propose employing multimodal LLMs for long-horizon planning in conjunction with detailed motion planning using our information-based algorithm. By leveraging high-quality view synthesis from our 3DGS representation, our method employs a multimodal LLM as a zero-shot planner for long-horizon exploration goals from the semantic perspective. We also introduce an uncertainty-aware path proposal and selection algorithm that balances the dual objectives of maximizing the information gain for the environment while minimizing the cost of localization errors. Experiments conducted on the Gibson and Habitat-Matterport 3D datasets demonstrate state-of-the-art results of the proposed method.
Related papers
- NextBestPath: Efficient 3D Mapping of Unseen Environments [33.62355071343121]
Previous approaches mainly predict the next best view near the agent's location, which is prone to getting stuck in local areas.
We introduce a novel dataset AiMDoom with a map generator for the Doom video game, enabling to better benchmark active 3D mapping in diverse indoor environments.
We propose a new method we call next-best-path (NBP), which predicts long-term goals rather than focusing solely on short-sighted views.
arXiv Detail & Related papers (2025-02-07T23:18:08Z) - 3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow [69.94527569577295]
3D vision and spatial reasoning have long been recognized as preferable for accurately perceiving our three-dimensional world.
Due to the difficulties in collecting high-quality 3D data, research in this area has only recently gained momentum.
We propose converting existing densely activated LLMs into mixture-of-experts (MoE) models, which have proven effective for multi-modal data processing.
arXiv Detail & Related papers (2025-01-28T04:31:19Z) - DELTA: Dense Efficient Long-range 3D Tracking for any video [82.26753323263009]
We introduce DELTA, a novel method that efficiently tracks every pixel in 3D space, enabling accurate motion estimation across entire videos.
Our approach leverages a joint global-local attention mechanism for reduced-resolution tracking, followed by a transformer-based upsampler to achieve high-resolution predictions.
Our method provides a robust solution for applications requiring fine-grained, long-term motion tracking in 3D space.
arXiv Detail & Related papers (2024-10-31T17:59:01Z) - Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task.
Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making.
Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z) - Embodied AI in Mobile Robots: Coverage Path Planning with Large Language Models [6.860460230412773]
We propose an LLM-embodied path planning framework for mobile agents.
Our proposed multi-layer architecture uses prompted LLMs in the path planning phase and integrates them with the mobile agents' low-level actuators.
Our experiments show that this framework can improve LLMs' 2D plane reasoning abilities and complete coverage path planning tasks.
arXiv Detail & Related papers (2024-07-02T12:38:46Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - World Models with Hints of Large Language Models for Goal Achieving [56.91610333715712]
Reinforcement learning struggles in the face of long-horizon tasks and sparse goals.
Inspired by human cognition, we propose a new multi-modal model-based RL approach named Dreaming with Large Language Models (M).DLL.M integrates the proposed hinting subgoals into the model rollouts to encourage goal discovery and reaching in challenging tasks.
arXiv Detail & Related papers (2024-06-11T15:49:08Z) - OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision.
We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range.
For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z) - SayPlan: Grounding Large Language Models using 3D Scene Graphs for
Scalable Robot Task Planning [15.346150968195015]
We introduce SayPlan, a scalable approach to large-scale task planning for robotics using 3D scene graph (3DSG) representations.
We evaluate our approach on two large-scale environments spanning up to 3 floors and 36 rooms with 140 assets and objects.
arXiv Detail & Related papers (2023-07-12T12:37:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.