EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting
- URL: http://arxiv.org/abs/2503.11345v1
- Date: Fri, 14 Mar 2025 12:21:26 GMT
- Title: EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting
- Authors: Di Li, Jie Feng, Jiahao Chen, Weisheng Dong, Guanbin Li, Guangming Shi, Licheng Jiao,
- Abstract summary: EgoSplat is a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding.<n>EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets.
- Score: 108.15136508964011
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Egocentric scenes exhibit frequent occlusions, varied viewpoints, and dynamic interactions compared to typical scene understanding tasks. Occlusions and varied viewpoints can lead to multi-view semantic inconsistencies, while dynamic objects may act as transient distractors, introducing artifacts into semantic feature modeling. To address these challenges, we propose EgoSplat, a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding. A multi-view consistent instance feature aggregation method is designed to leverage the segmentation and tracking capabilities of SAM2 to selectively aggregate complementary features across views for each instance, ensuring precise semantic representation of scenes. Additionally, an instance-aware spatial-temporal transient prediction module is constructed to improve spatial integrity and temporal continuity in predictions by incorporating spatial-temporal associations across multi-view instances, effectively reducing artifacts in the semantic reconstruction of egocentric scenes. EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets, outperforming existing methods with a 8.2% improvement in localization accuracy and a 3.7% improvement in segmentation mIoU on the ADT dataset, and setting a new benchmark in open-vocabulary egocentric scene understanding. The code will be made publicly available.
Related papers
- IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.
We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.
We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z) - SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments [10.303368447554591]
This paper proposes a multi-task framework to simultaneously predict scene flow and instance segmentation of full-temporal point clouds.
The novelty of this work is threefold: 1) developing a coarse-to-fine prediction based multitask scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self-supervised learning scheme, which utilizes coarse
arXiv Detail & Related papers (2025-03-19T02:43:19Z) - InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception [17.530797215534456]
3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality.<n>We propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances.<n>Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation.
arXiv Detail & Related papers (2024-11-28T16:08:36Z) - S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving [12.406655155106424]
We propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training.
Our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals.
Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs.
Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level.
arXiv Detail & Related papers (2024-10-30T15:00:06Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation [9.431926560072412]
We propose O2V-mapping, which utilizes voxel-based language and geometric features to create an open-vocabulary field.
Experiments on open-vocabulary object localization and semantic segmentation demonstrate that O2V-mapping achieves online construction of language scenes.
arXiv Detail & Related papers (2024-04-10T08:54:43Z) - N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields [112.02885337510716]
Nested Neural Feature Fields (N2F2) is a novel approach that employs hierarchical supervision to learn a single feature field.
We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space.
Our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization.
arXiv Detail & Related papers (2024-03-16T18:50:44Z) - Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence.
We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps.
We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z) - Semantic-guided modeling of spatial relation and object co-occurrence for indoor scene recognition [5.083140094792973]
SpaCoNet simultaneously models Spatial relation and Co-occurrence of objects guided by semantic segmentation.<n> Experimental results on three widely used scene datasets demonstrate the effectiveness and generality of the proposed method.
arXiv Detail & Related papers (2023-05-22T03:04:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.