PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception
- URL: http://arxiv.org/abs/2510.17568v2
- Date: Tue, 21 Oct 2025 18:59:28 GMT
- Title: PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception
- Authors: Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, Mengyu Wang,
- Abstract summary: PAGE-4D is a feedforward model that extends VGGT to dynamic scenes without post-processing.<n>It disentangles static and dynamic information by predicting a dynamics-aware mask.<n>Experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios.
- Score: 39.819707648812944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, and point cloud reconstruction -- all without post-processing. A central challenge in multi-task 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics-aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask -- suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.
Related papers
- Dynamic Visual SLAM using a General 3D Prior [27.487134809452147]
We propose a novel monocular visual SLAM system that can robustly estimate camera poses in dynamic scenes.<n> Specifically, we propose a feed-forward reconstruction model to precisely filter out dynamic regions.
arXiv Detail & Related papers (2025-12-07T14:44:06Z) - VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction [15.933288728509337]
VGGT4D is a training-free framework that extends the 3D foundation model VGGT for robust 4D scene reconstruction.<n>Our approach is motivated by the key finding that VGGT's global attention layers already implicitly encode rich, layer-wise dynamic cues.<n>Our method achieves superior performance in dynamic object segmentation, camera pose estimation, and dense reconstruction.
arXiv Detail & Related papers (2025-11-25T06:30:22Z) - 4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos [52.89084603734664]
We present 4D3R, a pose-free dynamic neural rendering framework that decouples static and dynamic components through a two-stage approach.<n>Our approach achieves up to 1.8dB PSNR improvement over state-of-the-art methods.
arXiv Detail & Related papers (2025-11-07T13:25:50Z) - DynaPose4D: High-Quality 4D Dynamic Content Generation via Pose Alignment Loss [5.644194272935956]
DynaPose4D is a framework that generates high-quality 4D dynamic content from a single static image.<n>Results show that DynaPose4D achieves excellent coherence, consistency, and fluidity in dynamic motion generation.
arXiv Detail & Related papers (2025-10-26T01:11:13Z) - C4D: 4D Made from 3D through Dual Correspondences [77.04731692213663]
We introduce C4D, a framework that leverages temporal correspondences to extend existing 3D reconstruction formulation to 4D.<n>C4D captures two types of correspondences: short-term optical flow and long-term point tracking.<n>We train a dynamic-aware point tracker that provides additional mobility information.
arXiv Detail & Related papers (2025-10-16T17:59:06Z) - D^2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes [40.371542172080105]
We propose D2USt3R that regresses 4D pointmaps simuliously capture both static and dynamic 3D scene geometry in a feed-forward manner.<n>By explicitly incorporating both spatial and temporal aspects, our approach successfully encapsulates object-temporal dense correspondence to the proposed 4D pointmaps, enhancing downstream tasks.
arXiv Detail & Related papers (2025-04-08T17:59:50Z) - Easi3R: Estimating Disentangled Motion from DUSt3R Without Training [69.51086319339662]
We introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction.<n>Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning.<n>Our experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2025-03-31T17:59:58Z) - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.<n>By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.<n>We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.