Related papers: ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

URL: http://arxiv.org/abs/2601.11508v1
Date: Fri, 16 Jan 2026 18:45:19 GMT
Title: ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
Authors: Emily Steiner, Jianhao Zheng, Henry Howard-Jenkins, Chris Xie, Iro Armeni,
Abstract summary: We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS)<n>We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations.<n>To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency.
Score: 11.119542051581917
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high-frequency temporal measurements that are uncommon in the longer-horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.

Related papers

SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning [11.93789125154006]
We propose a framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency.<n>SNOW processes synchronized 3D point clouds, using HDBSCAN clustering to generate segmentation proposals.<n> Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference.
arXiv Detail & Related papers (2025-12-18T12:27:06Z)
Online Segment Any 3D Thing as Instance Tracking [60.20416622842975]
We reconceptualize online 3D segmentation as an instance tracking problem (AutoSeg3D)<n>We introduce spatial consistency learning to mitigate the fragmentation problem inherent in Vision Foundation Models.<n>Our method establishes a new state-of-the-art, surpassing ESAM by 2.8 AP on ScanNet200.
arXiv Detail & Related papers (2025-12-08T14:48:51Z)
SCas4D: Structural Cascaded Optimization for Boosting Persistent 4D Novel View Synthesis [53.10680153186481]
We propose SCas4D, a cascaded optimization framework that leverages structural patterns in 3D Gaussian Splatting for dynamic scenes.<n>By progressively refining deformations from coarse part-level to fine point-level, SCas4D achieves convergence within 100 iterations per time frame.<n>The approach also demonstrates effectiveness in self-supervised articulated object segmentation, novel view synthesis, and dense point tracking tasks.
arXiv Detail & Related papers (2025-10-08T06:39:33Z)
Unsupervised Online 3D Instance Segmentation with Synthetic Sequences and Dynamic Loss [52.28880405119483]
Unsupervised online 3D instance segmentation is a fundamental yet challenging task.<n>Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity.<n>We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation.
arXiv Detail & Related papers (2025-09-27T08:53:27Z)
DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation [50.01520547454224]
Current generative models struggle to synthesize 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS)<n>We propose DiST-4D, which disentangles the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency.<n>Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.
arXiv Detail & Related papers (2025-03-19T13:49:48Z)
Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding [59.51535163599723]
FreeGS is an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels.<n>FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.
arXiv Detail & Related papers (2024-11-29T08:52:32Z)
Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture. Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model. Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z)
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion. Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments [20.890476387720483]
MoRE is a novel approach for multi-object relocalization and reconstruction in evolving environments. We view these environments as "living scenes" and consider the problem of transforming scans taken at different points in time into a 3D reconstruction of the object instances.
arXiv Detail & Related papers (2023-12-14T17:09:57Z)
Mask4Former: Mask Transformer for 4D Panoptic Segmentation [13.99703660936949]
Mask4Former is the first transformer-based approach unifying semantic instance segmentation and tracking. Our model directly predicts semantic instances their temporal associations without relying on hand-crafted non-learned association strategies. Mask4Former achieves a new state-of-the-art on the SemanticTITI test set with a score of 68.4 LSTQ.
arXiv Detail & Related papers (2023-09-28T03:30:50Z)
4D Panoptic LiDAR Segmentation [27.677435778317054]
We propose 4D panoptic LiDAR segmentation to assign a semantic class and a temporally-consistent instance ID to a sequence of 3D points. Inspired by recent advances in benchmarking of multi-object tracking, we propose to adopt a new evaluation metric that separates the semantic and point-to-instance association of the task.
arXiv Detail & Related papers (2021-02-24T18:56:16Z)
Unsupervised Domain Adaptation with Temporal-Consistent Self-Training for 3D Hand-Object Joint Reconstruction [131.34795312667026]
We introduce an effective approach to addressing this challenge by exploiting 3D geometric constraints within a cycle generative adversarial network (CycleGAN) In contrast to most existing works, we propose to enforce short- and long-term temporal consistency to fine-tune the domain-adapted model in a self-supervised fashion. We will demonstrate that our approach outperforms state-of-the-art 3D hand-object joint reconstruction methods on three widely-used benchmarks.
arXiv Detail & Related papers (2020-12-21T11:27:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.