TrajVG: 3D Trajectory-Coupled Visual Geometry Learning
- URL: http://arxiv.org/abs/2602.04439v2
- Date: Thu, 05 Feb 2026 02:51:25 GMT
- Title: TrajVG: 3D Trajectory-Coupled Visual Geometry Learning
- Authors: Xingyu Miao, Weiguang Zhao, Tao Lu, Linning Xu, Mulin Yu, Yang Long, Jiangmiao Pang, Junting Dong,
- Abstract summary: TrajVG is a reconstruction framework that makes cross-frame 3D correspondence explicit.<n>We reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks.<n>Experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.
- Score: 34.47385438436523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Feed-forward multi-frame 3D reconstruction models often degrade on videos with object motion. Global-reference becomes ambiguous under multiple motions, while the local pointmap relies heavily on estimated relative poses and can drift, causing cross-frame misalignment and duplicated structures. We propose TrajVG, a reconstruction framework that makes cross-frame 3D correspondence an explicit prediction by estimating camera-coordinate 3D trajectories. We couple sparse trajectories, per-frame local point maps, and relative camera poses with geometric consistency objectives: (i) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (ii) a pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. To scale training to in-the-wild videos where 3D trajectory labels are scarce, we reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks, enabling unified training with mixed supervision. Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.
Related papers
- Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment [15.822150318879052]
We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment.<n>We train a lightweight feature adapter using a reprojection-based consistency loss.<n>This enables state-of-the-art performance in both NVS and camera pose estimation.
arXiv Detail & Related papers (2025-12-09T18:59:52Z) - C4D: 4D Made from 3D through Dual Correspondences [77.04731692213663]
We introduce C4D, a framework that leverages temporal correspondences to extend existing 3D reconstruction formulation to 4D.<n>C4D captures two types of correspondences: short-term optical flow and long-term point tracking.<n>We train a dynamic-aware point tracker that provides additional mobility information.
arXiv Detail & Related papers (2025-10-16T17:59:06Z) - SpatialTrackerV2: 3D Point Tracking Made Easy [73.0350898700048]
SpatialTrackerV2 is a feed-forward 3D point tracking method for monocular videos.<n>It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion.<n>By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%.
arXiv Detail & Related papers (2025-07-16T17:59:03Z) - St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World [106.91539872943864]
St4RTrack is a framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs.<n>We predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry.<n>We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework.
arXiv Detail & Related papers (2025-04-17T17:55:58Z) - POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction [53.19968902152528]
We present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion.<n>Specifically, our method learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps.<n>We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks.
arXiv Detail & Related papers (2025-04-08T05:33:13Z) - PoseTraj: Pose-Aware Trajectory Control in Video Diffusion [17.0187150041712]
We introduce PoseTraj, a pose-aware video dragging model for generating 3D-aligned motion from 2D trajectories.<n>Our method adopts a novel two-stage pose-aware pretraining framework, improving 3D understanding across diverse trajectories.
arXiv Detail & Related papers (2025-03-20T12:01:43Z) - T-3DGS: Removing Transient Objects for 3D Scene Reconstruction [83.05271859398779]
Transient objects in video sequences can significantly degrade the quality of 3D scene reconstructions.<n>We propose T-3DGS, a novel framework that robustly filters out transient distractors during 3D reconstruction using Gaussian Splatting.
arXiv Detail & Related papers (2024-11-29T07:45:24Z) - Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo [71.59494156155309]
Existing approaches for multi-view 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views.
We present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot.
arXiv Detail & Related papers (2021-04-06T03:49:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.