Multimodal LLM Guided Exploration and Active Mapping using Fisher Information
- URL: http://arxiv.org/abs/2410.17422v2
- Date: Wed, 04 Dec 2024 22:03:08 GMT
- Title: Multimodal LLM Guided Exploration and Active Mapping using Fisher Information
- Authors: Wen Jiang, Boshu Lei, Katrina Ashton, Kostas Daniilidis,
- Abstract summary: We present an active mapping system that could plan for long-horizon exploration goals and short-term actions with a 3D Gaussian Splatting representation.<n> Experiments conducted on the Gibson and Habitat-Matterport 3D datasets demonstrate state-of-the-art results of the proposed method.
- Score: 26.602364433232445
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present an active mapping system that could plan for long-horizon exploration goals and short-term actions with a 3D Gaussian Splatting (3DGS) representation. Existing methods either did not take advantage of recent developments in multimodal Large Language Models (LLM) or did not consider challenges in localization uncertainty, which is critical in embodied agents. We propose employing multimodal LLMs for long-horizon planning in conjunction with detailed motion planning using our information-based algorithm. By leveraging high-quality view synthesis from our 3DGS representation, our method employs a multimodal LLM as a zero-shot planner for long-horizon exploration goals from the semantic perspective. We also introduce an uncertainty-aware path proposal and selection algorithm that balances the dual objectives of maximizing the information gain for the environment while minimizing the cost of localization errors. Experiments conducted on the Gibson and Habitat-Matterport 3D datasets demonstrate state-of-the-art results of the proposed method.
Related papers
- Latent Diffusion Planning for Imitation Learning [78.56207566743154]
Latent Diffusion Planning (LDP) is a modular approach consisting of a planner and inverse dynamics model.
By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data.
On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches.
arXiv Detail & Related papers (2025-04-23T17:53:34Z) - Empowering Large Language Models with 3D Situation Awareness [84.12071023036636]
A key difference between 3D and 2D is that the situation of an egocentric observer in 3D scenes can change, resulting in different descriptions.
We propose a novel approach to automatically generate a situation-aware dataset by leveraging the scanning trajectory during data collection.
We introduce a situation grounding module to explicitly predict the position and orientation of observer's viewpoint, thereby enabling LLMs to ground situation description in 3D scenes.
arXiv Detail & Related papers (2025-03-29T09:34:16Z) - NextBestPath: Efficient 3D Mapping of Unseen Environments [33.62355071343121]
Previous approaches mainly predict the next best view near the agent's location, which is prone to getting stuck in local areas.
We introduce a novel dataset AiMDoom with a map generator for the Doom video game, enabling to better benchmark active 3D mapping in diverse indoor environments.
We propose a new method we call next-best-path (NBP), which predicts long-term goals rather than focusing solely on short-sighted views.
arXiv Detail & Related papers (2025-02-07T23:18:08Z) - 3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow [69.94527569577295]
3D vision and spatial reasoning have long been recognized as preferable for accurately perceiving our three-dimensional world.
Due to the difficulties in collecting high-quality 3D data, research in this area has only recently gained momentum.
We propose converting existing densely activated LLMs into mixture-of-experts (MoE) models, which have proven effective for multi-modal data processing.
arXiv Detail & Related papers (2025-01-28T04:31:19Z) - Mode-GS: Monocular Depth Guided Anchored 3D Gaussian Splatting for Robust Ground-View Scene Rendering [47.879695094904015]
We present a novelview rendering algorithm, Mode-GS, for ground-robot trajectory datasets.
Our approach is based on using anchored Gaussian splats, which are designed to overcome the limitations of existing 3D Gaussian splatting algorithms.
Our method results in improved rendering performance, based on PSNR, SSIM, and LPIPS metrics, in ground scenes with free trajectory patterns.
arXiv Detail & Related papers (2024-10-06T23:01:57Z) - Towards Real-Time Gaussian Splatting: Accelerating 3DGS through Photometric SLAM [4.08109886949724]
We propose integrating 3DGS with Direct Sparse Odometry, a monocular photometric SLAM system.
Preliminary experiments show that using Direct Sparse Odometry point cloud outputs, as opposed to standard structure-from-motion methods, significantly shortens the training time needed to achieve high-quality renders.
arXiv Detail & Related papers (2024-08-07T15:01:08Z) - IG-SLAM: Instant Gaussian SLAM [6.228980850646457]
3D Gaussian Splatting has recently shown promising results as an alternative scene representation in SLAM systems.
We present IG-SLAM, a dense RGB-only SLAM system that employs robust Dense-SLAM methods for tracking and combines them with Gaussian Splatting.
We demonstrate competitive performance with state-of-the-art RGB-only SLAM systems while achieving faster operation speeds.
arXiv Detail & Related papers (2024-08-02T09:07:31Z) - Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task.
Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making.
Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z) - Embodied AI in Mobile Robots: Coverage Path Planning with Large Language Models [6.860460230412773]
We propose an LLM-embodied path planning framework for mobile agents.
Our proposed multi-layer architecture uses prompted LLMs in the path planning phase and integrates them with the mobile agents' low-level actuators.
Our experiments show that this framework can improve LLMs' 2D plane reasoning abilities and complete coverage path planning tasks.
arXiv Detail & Related papers (2024-07-02T12:38:46Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - World Models with Hints of Large Language Models for Goal Achieving [56.91610333715712]
Reinforcement learning struggles in the face of long-horizon tasks and sparse goals.
Inspired by human cognition, we propose a new multi-modal model-based RL approach named Dreaming with Large Language Models (M).DLL.M integrates the proposed hinting subgoals into the model rollouts to encourage goal discovery and reaching in challenging tasks.
arXiv Detail & Related papers (2024-06-11T15:49:08Z) - MotionGS : Compact Gaussian Splatting SLAM by Motion Filter [10.979138131565238]
There has been a surge in NeRF-based SLAM, while 3DGS-based SLAM is sparse.
A novel 3DGS-based SLAM approach with a fusion of deep visual feature, dual selection and 3DGS is presented in this paper.
arXiv Detail & Related papers (2024-05-18T00:47:29Z) - MM3DGS SLAM: Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial Measurements [59.70107451308687]
We show for the first time that using 3D Gaussians for map representation with unposed camera images and inertial measurements can enable accurate SLAM.
Our method, MM3DGS, addresses the limitations of prior rendering by enabling faster scale awareness, and improved trajectory tracking.
We also release a multi-modal dataset, UT-MM, collected from a mobile robot equipped with a camera and an inertial measurement unit.
arXiv Detail & Related papers (2024-04-01T04:57:41Z) - CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field [46.8198987091734]
This paper presents an efficient dense RGB-D SLAM system, i.e., CG-SLAM, based on a novel uncertainty-aware 3D Gaussian field.
Experiments on various datasets demonstrate that CG-SLAM achieves superior tracking and mapping performance with a notable tracking speed of up to 15 Hz.
arXiv Detail & Related papers (2024-03-24T11:19:59Z) - GaussianPro: 3D Gaussian Splatting with Progressive Propagation [49.918797726059545]
3DGS relies heavily on the point cloud produced by Structure-from-Motion (SfM) techniques.
We propose a novel method that applies a progressive propagation strategy to guide the densification of the 3D Gaussians.
Our method significantly surpasses 3DGS on the dataset, exhibiting an improvement of 1.15dB in terms of PSNR.
arXiv Detail & Related papers (2024-02-22T16:00:20Z) - MoD-SLAM: Monocular Dense Mapping for Unbounded 3D Scene Reconstruction [2.3630527334737104]
MoD-SLAM is the first monocular NeRF-based dense mapping method that allows 3D reconstruction in real-time in unbounded scenes.
By introducing a robust depth loss term into the tracking process, our SLAM system achieves more precise pose estimation in large-scale scenes.
Our experiments on two standard datasets show that MoD-SLAM achieves competitive performance, improving the accuracy of the 3D reconstruction and localization by up to 30% and 15% respectively.
arXiv Detail & Related papers (2024-02-06T07:07:33Z) - FIT-SLAM -- Fisher Information and Traversability estimation-based
Active SLAM for exploration in 3D environments [1.4474137122906163]
Active visual SLAM finds a wide array of applications in-Denied sub-terrain environments and outdoor environments for ground robots.
It is imperative to incorporate the perception considerations in the goal selection and path planning towards the goal during an exploration mission.
We propose FIT-SLAM, a new exploration method tailored for unmanned ground vehicles (UGVs) to explore 3D environments.
arXiv Detail & Related papers (2024-01-17T16:46:38Z) - A Survey on 3D Gaussian Splatting [51.96747208581275]
3D Gaussian splatting (GS) has emerged as a transformative technique in the realm of explicit radiance field and computer graphics.
We provide the first systematic overview of the recent developments and critical contributions in the domain of 3D GS.
By enabling unprecedented rendering speed, 3D GS opens up a plethora of applications, ranging from virtual reality to interactive media and beyond.
arXiv Detail & Related papers (2024-01-08T13:42:59Z) - GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting [51.96353586773191]
We introduce textbfGS-SLAM that first utilizes 3D Gaussian representation in the Simultaneous Localization and Mapping system.
Our method utilizes a real-time differentiable splatting rendering pipeline that offers significant speedup to map optimization and RGB-D rendering.
Our method achieves competitive performance compared with existing state-of-the-art real-time methods on the Replica, TUM-RGBD datasets.
arXiv Detail & Related papers (2023-11-20T12:08:23Z) - SayPlan: Grounding Large Language Models using 3D Scene Graphs for
Scalable Robot Task Planning [15.346150968195015]
We introduce SayPlan, a scalable approach to large-scale task planning for robotics using 3D scene graph (3DSG) representations.
We evaluate our approach on two large-scale environments spanning up to 3 floors and 36 rooms with 140 assets and objects.
arXiv Detail & Related papers (2023-07-12T12:37:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.