LONG3R: Long Sequence Streaming 3D Reconstruction
- URL: http://arxiv.org/abs/2507.18255v1
- Date: Thu, 24 Jul 2025 09:55:20 GMT
- Title: LONG3R: Long Sequence Streaming 3D Reconstruction
- Authors: Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, Hang Zhao,
- Abstract summary: Long3R is a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences.<n>Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation.<n>Experiments demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences.
- Score: 29.79885827038617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in multi-view scene reconstruction have been significant, yet existing methods face limitations when processing streams of input images. These methods either rely on time-consuming offline optimization or are restricted to shorter sequences, hindering their applicability in real-time scenarios. In this work, we propose LONG3R (LOng sequence streaming 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation. We first employ a memory gating mechanism to filter relevant memory, which, together with a new observation, is fed into a dual-source refined decoder for coarse-to-fine interaction. To effectively capture long-sequence memory, we propose a 3D spatio-temporal memory that dynamically prunes redundant spatial information while adaptively adjusting resolution along the scene. To enhance our model's performance on long sequences while maintaining training efficiency, we employ a two-stage curriculum training strategy, each stage targeting specific capabilities. Experiments demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences, while maintaining real-time inference speed. Project page: https://zgchen33.github.io/LONG3R/.
Related papers
- RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z) - Cross-Temporal 3D Gaussian Splatting for Sparse-View Guided Scene Update [17.581193784542357]
Updating 3D scenes from sparse-view observations is crucial for various real-world applications.<n>We propose Cross-Temporal 3D Gaussian Splatting (Cross-Temporal 3DGS), a novel framework for efficiently reconstructing and updating 3D scenes.<n> Experimental results show significant improvements over baseline methods in reconstruction quality and data efficiency.
arXiv Detail & Related papers (2025-11-29T16:00:24Z) - STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer [72.88105562624838]
We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem.<n>By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios.<n>Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments.
arXiv Detail & Related papers (2025-08-14T17:58:05Z) - mGRADE: Minimal Recurrent Gating Meets Delay Convolutions for Lightweight Sequence Modeling [0.5236468296934584]
mGRADE is a hybrid-memory system that integrates a temporal 1D-convolution with learnable spacings followed by a minimal gated recurrent unit.<n>We demonstrate that mGRADE effectively separates and preserves multi-scale temporal features.<n>This highlights mGRADE's promise as an efficient solution for memory-constrained multi-scale temporal processing at the edge.
arXiv Detail & Related papers (2025-07-02T15:44:35Z) - Long-Sequence Memory with Temporal Kernels and Dense Hopfield Functionals [0.0]
Building upon earlier work on long-sequence Hopfield memory models, we propose a temporal kernal $K(m, k)$ to incorporate temporal dependencies.<n>We demonstrate the successful application of this technique for the storage and sequential retrieval of movies frames.
arXiv Detail & Related papers (2025-06-27T15:57:58Z) - LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering [68.93333348474988]
We present a novel level-of-detail (LOD) method for 3D Gaussian Splatting on memory-constrained devices.<n>Our approach iteratively selects optimal subsets of Gaussians based on camera distance.<n>Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets.
arXiv Detail & Related papers (2025-05-29T06:50:57Z) - Exploiting Temporal State Space Sharing for Video Semantic Segmentation [53.8810901249897]
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes.<n>Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements.<n>We introduce a Temporal Video State Space Sharing architecture to leverage Mamba state space models for temporal feature sharing.<n>Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool.
arXiv Detail & Related papers (2025-03-26T01:47:42Z) - VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment [54.66217340264935]
VideoLifter is a novel video-to-3D pipeline that leverages a local-to-global strategy on a fragment basis.<n>It significantly accelerates the reconstruction process, reducing training time by over 82% while holding better visual quality than current SOTA methods.
arXiv Detail & Related papers (2025-01-03T18:52:36Z) - MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD [27.472705540825316]
This paper is on long-term video understanding where the goal is to recognise human actions over long temporal windows (up to minutes long)
We propose an alternative to attention-based schemes which is based on a low-rank approximation of the memory obtained using Singular Value Decomposition.
Our scheme has two advantages: (a) it reduces complexity by more than an order of magnitude, and (b) it is amenable to an efficient implementation for the calculation of the memory bases.
arXiv Detail & Related papers (2024-06-11T12:03:57Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.