Related papers: OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

URL: http://arxiv.org/abs/2603.02134v2
Date: Tue, 03 Mar 2026 14:03:57 GMT
Title: OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution
Authors: Chong Xia, Fangfu Liu, Yule Wang, Yize Pang, Yueqi Duan,
Abstract summary: We introduce OnlineX, a feed-forward framework that reconstructs both 3D visual appearance and language fields in an online manner using only streaming images.<n>Our framework decouples the memory state into a dedicated active state and a persistent stable state, and then cohesively fuses the information from the former into the latter to achieve both fidelity and stability.
Score: 34.8105632078785
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in generalizable 3D Gaussian Splatting (3DGS) have enabled rapid 3D scene reconstruction within seconds, eliminating the need for per-scene optimization. However, existing methods primarily follow an offline reconstruction paradigm, lacking the capacity for continuous reconstruction, which limits their applicability to online scenarios such as robotics and VR/AR. In this paper, we introduce OnlineX, a feed-forward framework that reconstructs both 3D visual appearance and language fields in an online manner using only streaming images. A key challenge in online formulation is the cumulative drift issue, which is rooted in the fundamental conflict between two opposing roles of the memory state: an active role that constantly refreshes to capture high-frequency local geometry, and a stable role that conservatively accumulates and preserves the long-term global structure. To address this, we introduce a decoupled active-to-stable state evolution paradigm. Our framework decouples the memory state into a dedicated active state and a persistent stable state, and then cohesively fuses the information from the former into the latter to achieve both fidelity and stability. Moreover, we jointly model visual appearance and language fields and incorporate an implicit Gaussian fusion module to enhance reconstruction quality. Experiments on mainstream datasets demonstrate that our method consistently outperforms prior work in novel view synthesis and semantic understanding, showcasing robust performance across input sequences of varying lengths with real-time inference speed.

Related papers

StdGEN++: A Comprehensive System for Semantic-Decomposed 3D Character Generation [57.06461272772509]
StdGEN++ is a novel and comprehensive system for generating high-fidelity, semantically decomposed 3D characters from diverse inputs.<n>It achieves state-of-the-art performance, significantly outperforming existing methods in geometric accuracy and semantic disentanglement.<n>The resulting structural independence unlocks advanced downstream capabilities, including non-destructive editing, physics-compliant animation, and gaze tracking.
arXiv Detail & Related papers (2026-01-12T15:41:27Z)
RecurGS: Interactive Scene Modeling via Discrete-State Recurrent Gaussian Fusion [21.761449995572757]
RecurGS is a recurrent fusion framework that integrates discrete Gaussian scene states into a single evolving representation.<n>A voxelized, visibility-aware fusion module selectively incorporates newly observed regions while keeping stable areas fixed.<n>Our framework delivers high-quality reconstructions with substantially improved update efficiency.
arXiv Detail & Related papers (2025-12-20T14:53:22Z)
RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z)
UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction [26.278318116942526]
We present UniSplat, a feed-forward framework that learns robust dynamic scene reconstruction through unified latent-temporal fusion.<n>Experiments on real-world datasets demonstrate that UniSplat achieves state-of-the-art synthesis in novel view, while providing robust and high-quality renderings for viewpoints outside the original camera coverage.
arXiv Detail & Related papers (2025-11-06T17:49:39Z)
LUMA-RAG: Lifelong Multimodal Agents with Provably Stable Streaming Alignment [0.0]
Retrieval-Augmented Generation has emerged as the dominant paradigm for grounding large language model outputs in verifiable evidence.<n>We present LUMA-RAG, a lifelong multimodal agent architecture featuring three key innovations.<n> Experiments demonstrate robust text-to-image retrieval (Recall@10 = 0.94), graceful performance degradation under product quantization offloading, and provably stable audio-to-image rankings (Safe@1 = 1.0)
arXiv Detail & Related papers (2025-11-04T08:47:12Z)
OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects [58.38338242973447]
OnlineSplatter is a novel framework generating high-quality, object-centric 3D Gaussians directly from RGB frames.<n>Our approach anchors reconstruction using the first frame and progressively refines the object representation through a dense Gaussian primitive field.<n>Our core contribution is a dual-key memory module combining latent appearance-geometry keys with explicit directional keys.
arXiv Detail & Related papers (2025-10-23T14:37:25Z)
PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation [18.771349697842947]
This work introduces the Pattern Reuse Graph Conal Network (PRGCN), a novel framework that formalizes pose estimation as a problem of pattern retrieval and adaptation.<n>At its core, PRGCN features a graph memory bank that learns and stores a compact set of pose prototypes, encoded as relational graphs, which are dynamically retrieved via an attention mechanism to provide structured priors.<n>Our work posits that PRGCN establishes a new state-of-the-art, achieving an MPJPE of 37.1mm and 13.4mm, respectively, while exhibiting enhanced cross-domain generalization capability.
arXiv Detail & Related papers (2025-10-22T11:12:07Z)
Puppeteer: Rig and Animate Your 3D Models [105.11046762553121]
Puppeteer is a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects.<n>Our system first predicts plausible skeletal structures via an auto-regressive transformer.<n>It then infers skinning weights via an attention-based architecture.
arXiv Detail & Related papers (2025-08-14T17:59:31Z)
VEIGAR: View-consistent Explicit Inpainting and Geometry Alignment for 3D object Removal [2.8954284913103367]
Novel View Synthesis (NVS) and 3D generation have significantly improved editing tasks.<n>To maintain cross-view consistency throughout the generative process, methods typically address this challenge using a dual-strategy framework.<n>We present VEIGAR, a computationally efficient framework that outperforms existing methods without relying on an initial reconstruction phase.
arXiv Detail & Related papers (2025-06-13T11:31:44Z)
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion. Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream. At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank. To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.