OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics
- URL: http://arxiv.org/abs/2512.08625v1
- Date: Tue, 09 Dec 2025 14:10:23 GMT
- Title: OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics
- Authors: Jisang Yoo, Gyeongjin Kang, Hyun-kyu Ko, Hyeonwoo Yu, Eunbyung Park,
- Abstract summary: We present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding.<n>Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives.
- Score: 18.261860689671582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.
Related papers
- Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z) - From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs [65.04549036809557]
We introduce a benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors.<n>This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions.<n> Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings.
arXiv Detail & Related papers (2025-12-22T18:58:12Z) - SegSplat: Feed-forward Gaussian Splatting and Open-Set Semantic Segmentation [114.57192386025373]
SegSplat is a novel framework designed to bridge the gap between rapid, feed-forward 3D reconstruction and rich, open-vocabulary semantic understanding.<n>This work represents a significant step towards practical, on-the-fly generation of semantically aware 3D environments.
arXiv Detail & Related papers (2025-11-23T10:26:38Z) - LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM [2.0524609401792397]
We propose LEGO-SLAM, a framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system.<n>At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space.<n>Experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.
arXiv Detail & Related papers (2025-11-20T08:31:34Z) - MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU, GNSS for High-Functionality SLAM [12.158063913401575]
We propose MASt3R-Fusion, a multi-sensor-assisted visual SLAM framework that integrates feed-forward pointmap regression with complementary sensor information.<n>A hierarchical factor graph design is developed, which allows both real-time sliding-window optimization and global optimization with aggressive loop closures.<n>We evaluate our approach on both public benchmarks and self-collected datasets, demonstrating substantial improvements in accuracy and robustness.
arXiv Detail & Related papers (2025-09-25T05:26:28Z) - SPHERE: Semantic-PHysical Engaged REpresentation for 3D Semantic Scene Completion [52.959716866316604]
Camera-based 3D Semantic Scene Completion (SSC) is a critical task in autonomous driving systems.<n>We propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC.<n>SPHERE integrates voxel and Gaussian representations for joint exploitation of semantic and physical information.
arXiv Detail & Related papers (2025-09-14T09:07:41Z) - MCOO-SLAM: A Multi-Camera Omnidirectional Object SLAM System [19.16370123474815]
We propose MCOO-SLAM, a novel Multi-Camera Omnidirectional Object SLAM system.<n>Our approach integrates point features and object-level landmarks enhanced with open-vocabulary semantics.<n>Extensive experiments in real-world demonstrate that MCOO-SLAM achieves accurate localization and scalable object-level mapping.
arXiv Detail & Related papers (2025-06-18T12:20:34Z) - GSFF-SLAM: 3D Semantic Gaussian Splatting SLAM via Feature Field [17.57215792490409]
GSFF-SLAM is a novel dense semantic SLAM system based on 3D Gaussian Splatting.<n>Our method supports semantic reconstruction using various forms of 2D priors, particularly sparse and noisy signals.<n>When utilizing 2D ground truth priors, GSFF-SLAM achieves state-of-the-art semantic segmentation performance with 95.03% mIoU.
arXiv Detail & Related papers (2025-04-28T01:21:35Z) - IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.<n>We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.<n>We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z) - econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians [56.85804719947]
We propose econSG for open-vocabulary semantic segmentation with 3DGS.<n>Our econSG shows state-of-the-art performance on four benchmark datasets compared to the existing methods.
arXiv Detail & Related papers (2025-04-08T13:12:31Z) - GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields [50.68719394443926]
Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) is a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics.
GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-04-01T05:19:50Z) - Using Detection, Tracking and Prediction in Visual SLAM to Achieve
Real-time Semantic Mapping of Dynamic Scenarios [70.70421502784598]
RDS-SLAM can build semantic maps at object level for dynamic scenarios in real time using only one commonly used Intel Core i7 CPU.
We evaluate RDS-SLAM in TUM RGB-D dataset, and experimental results show that RDS-SLAM can run with 30.3 ms per frame in dynamic scenarios.
arXiv Detail & Related papers (2022-10-10T11:03:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.