Geometry-Aware Rotary Position Embedding for Consistent Video World Model
- URL: http://arxiv.org/abs/2602.07854v2
- Date: Tue, 17 Feb 2026 06:42:24 GMT
- Title: Geometry-Aware Rotary Position Embedding for Consistent Video World Model
- Authors: Chendong Xiang, Jiajun Liu, Jintao Zhang, Xiao Yang, Zhengwei Fang, Shizun Wang, Zijun Wang, Yingtian Zou, Hang Su, Jun Zhu,
- Abstract summary: ViewRope is a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers.<n>Geometry-Aware Frame-Sparse Attention exploits these geometric cues to selectively attend to relevant historical frames.<n>Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.
- Score: 48.914346802616414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware Frame-Sparse Attention}, which exploits these geometric cues to selectively attend to relevant historical frames, improving efficiency without sacrificing memory consistency. We also present \textbf{ViewBench}, a diagnostic suite measuring loop-closure fidelity and geometric drift. Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.
Related papers
- Geometry OR Tracker: Universal Geometric Operating Room Tracking [61.399734016038614]
In operating rooms (OR), world-scale multi-view 3D tracking supports downstream applications such as surgeon behavior recognition.<n>Camera calibration and RGB-D registration are always unreliable, leading to cross-view geometric inconsistency.<n>We introduce Geometry OR Tracker, a two-stage pipeline that rectifies imprecise calibration into a scaleconsistent and geometrically consistent camera setup.
arXiv Detail & Related papers (2026-02-28T09:21:21Z) - DVGT: Driving Visual Geometry Transformer [63.38483879291505]
A driving-targeted dense geometry perception model can adapt to different scenarios and camera configurations.<n>We propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs.<n>DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations.
arXiv Detail & Related papers (2025-12-18T18:59:57Z) - Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency [23.121660279216528]
Grab-3D is a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency.<n>We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric head to explicitly inject 3D geometric awareness into temporal modeling.
arXiv Detail & Related papers (2025-12-15T18:54:30Z) - ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points [32.23473666846317]
We propose ControlVP, a user-guided framework for correcting vanishing point inconsistencies in generated images.<n>Our approach extends a pre-trained diffusion model by incorporating structural guidance derived from building contours.<n>Our method enhances global geometric consistency while maintaining visual fidelity comparable to the baselines.
arXiv Detail & Related papers (2025-12-08T12:38:11Z) - Look Around and Pay Attention: Multi-camera Point Tracking Reimagined with Transformers [5.025261312338861]
LAPA (Look Around and Pay Attention) is a novel end-to-end transformer-based architecture for multi-camera point tracking.<n>Instead of relying on classical triangulation, we construct 3D point representations via attention-weighted aggregation.<n>Experiments on challenging datasets, including our newly created multi-camera (MC) versions of TAPVid-3D panoptic and PointOdyssey, demonstrate that our unified approach significantly outperforms existing methods.
arXiv Detail & Related papers (2025-12-03T19:34:08Z) - GeoVideo: Introducing Geometric Regularization into Video Generation Model [46.38507581500745]
We introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction.<n>Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved structural coherence-temporal shape, consistency, and physical plausibility.
arXiv Detail & Related papers (2025-12-03T05:11:57Z) - TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction [57.46712611558817]
3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass.<n>Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry.<n>We propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies.
arXiv Detail & Related papers (2025-12-02T02:22:20Z) - 4D Driving Scene Generation With Stereo Forcing [62.47705572424127]
Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization.<n>We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency.
arXiv Detail & Related papers (2025-09-24T15:37:17Z) - GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering [54.489285024494855]
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent.<n>Existing approaches, depending on the domain they operate, suffer from several issues that degrade the user experience.<n>We introduce textbfGaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent local reconstruction and rendering' paradigm.
arXiv Detail & Related papers (2025-06-30T15:24:27Z) - Breaking Down Monocular Ambiguity: Exploiting Temporal Evolution for 3D Lane Detection [79.98605061363999]
Monocular 3D lane detection aims to estimate the 3D position of lanes from frontal-view (FV) images.<n>Existing methods are constrained by the inherent ambiguity of single-frame input.<n>We propose to unlock the rich information embedded in the temporal evolution of the scene as the vehicle moves.
arXiv Detail & Related papers (2025-04-29T08:10:17Z) - GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors [47.21120442961684]
We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos.<n>We show that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.
arXiv Detail & Related papers (2025-04-01T17:58:03Z) - Attention meets Geometry: Geometry Guided Spatial-Temporal Attention for
Consistent Self-Supervised Monocular Depth Estimation [42.249533907879126]
This paper explores how the increasingly popular transformer architecture, together with novel regularized loss formulations, can improve depth consistency.
We propose a spatial attention module that correlates coarse depth predictions to aggregate local geometric information.
A novel temporal attention mechanism further processes the local geometric information in a global context across consecutive images.
arXiv Detail & Related papers (2021-10-15T16:43:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.