Online Segment Any 3D Thing as Instance Tracking
- URL: http://arxiv.org/abs/2512.07599v1
- Date: Mon, 08 Dec 2025 14:48:51 GMT
- Title: Online Segment Any 3D Thing as Instance Tracking
- Authors: Hanshi Wang, Zijian Cai, Jin Gao, Yiwei Zhang, Weiming Hu, Ke Wang, Zhipeng Zhang,
- Abstract summary: We reconceptualize online 3D segmentation as an instance tracking problem (AutoSeg3D)<n>We introduce spatial consistency learning to mitigate the fragmentation problem inherent in Vision Foundation Models.<n>Our method establishes a new state-of-the-art, surpassing ESAM by 2.8 AP on ScanNet200.
- Score: 60.20416622842975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online, real-time, and fine-grained 3D segmentation constitutes a fundamental capability for embodied intelligent agents to perceive and comprehend their operational environments. Recent advancements employ predefined object queries to aggregate semantic information from Vision Foundation Models (VFMs) outputs that are lifted into 3D point clouds, facilitating spatial information propagation through inter-query interactions. Nevertheless, perception is an inherently dynamic process, rendering temporal understanding a critical yet overlooked dimension within these prevailing query-based pipelines. Therefore, to further unlock the temporal environmental perception capabilities of embodied agents, our work reconceptualizes online 3D segmentation as an instance tracking problem (AutoSeg3D). Our core strategy involves utilizing object queries for temporal information propagation, where long-term instance association promotes the coherence of features and object identities, while short-term instance update enriches instant observations. Given that viewpoint variations in embodied robotics often lead to partial object visibility across frames, this mechanism aids the model in developing a holistic object understanding beyond incomplete instantaneous views. Furthermore, we introduce spatial consistency learning to mitigate the fragmentation problem inherent in VFMs, yielding more comprehensive instance information for enhancing the efficacy of both long-term and short-term temporal learning. The temporal information exchange and consistency learning facilitated by these sparse object queries not only enhance spatial comprehension but also circumvent the computational burden associated with dense temporal point cloud interactions. Our method establishes a new state-of-the-art, surpassing ESAM by 2.8 AP on ScanNet200 and delivering consistent gains on ScanNet, SceneNN, and 3RScan datasets.
Related papers
- ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes [11.119542051581917]
We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS)<n>We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations.<n>To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency.
arXiv Detail & Related papers (2026-01-16T18:45:19Z) - Which2comm: An Efficient Collaborative Perception Framework for 3D Object Detection [5.195291754828701]
Collaborative perception allows real-time inter-agent information exchange.<n>limited communication bandwidth in practical scenarios restricts the inter-agent data transmission volume.<n>We propose Which2comm, a novel multi-agent 3D object detection framework leveraging object-level sparse features.
arXiv Detail & Related papers (2025-03-21T14:24:07Z) - SUIT: Learning Significance-guided Information for 3D Temporal Detection [15.237488449422008]
We learn Significance-gUided Information for 3D Temporal detection (SUIT), which simplifies temporal information as sparse features for information fusion across frames.
We evaluate our method on large-scale nuScenes and dataset, where our SUIT not only significantly reduces the memory and cost of temporal fusion, but also performs well over the state-of-the-art baselines.
arXiv Detail & Related papers (2023-07-04T16:22:10Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - Spatio-temporal Tendency Reasoning for Human Body Pose and Shape
Estimation from Videos [10.50306784245168]
We present atemporal tendency reasoning (STR) network for recovering human body pose shape from videos.
Our STR aims to learn accurate and spatial motion sequences in an unconstrained environment.
Our STR remains competitive with the state-of-the-art on three datasets.
arXiv Detail & Related papers (2022-10-07T16:09:07Z) - AGO-Net: Association-Guided 3D Point Cloud Object Detection Network [86.10213302724085]
We propose a novel 3D detection framework that associates intact features for objects via domain adaptation.
We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed.
arXiv Detail & Related papers (2022-08-24T16:54:38Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.