Related papers: Geometry-Aware Rotary Position Embedding for Consistent Video World Model

Geometry-Aware Rotary Position Embedding for Consistent Video World Model

URL: http://arxiv.org/abs/2602.07854v2
Date: Tue, 17 Feb 2026 06:42:24 GMT
Title: Geometry-Aware Rotary Position Embedding for Consistent Video World Model
Authors: Chendong Xiang, Jiajun Liu, Jintao Zhang, Xiao Yang, Zhengwei Fang, Shizun Wang, Zijun Wang, Yingtian Zou, Hang Su, Jun Zhu,
Abstract summary: ViewRope is a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers.<n>Geometry-Aware Frame-Sparse Attention exploits these geometric cues to selectively attend to relevant historical frames.<n>Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.
Score: 48.914346802616414
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware Frame-Sparse Attention}, which exploits these geometric cues to selectively attend to relevant historical frames, improving efficiency without sacrificing memory consistency. We also present \textbf{ViewBench}, a diagnostic suite measuring loop-closure fidelity and geometric drift. Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.

Related papers

Geometry OR Tracker: Universal Geometric Operating Room Tracking [61.399734016038614]
In operating rooms (OR), world-scale multi-view 3D tracking supports downstream applications such as surgeon behavior recognition.<n>Camera calibration and RGB-D registration are always unreliable, leading to cross-view geometric inconsistency.<n>We introduce Geometry OR Tracker, a two-stage pipeline that rectifies imprecise calibration into a scaleconsistent and geometrically consistent camera setup.
arXiv Detail & Related papers (2026-02-28T09:21:21Z)
DVGT: Driving Visual Geometry Transformer [63.38483879291505]
A driving-targeted dense geometry perception model can adapt to different scenarios and camera configurations.<n>We propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs.<n>DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations.
arXiv Detail & Related papers (2025-12-18T18:59:57Z)
Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency [23.121660279216528]
Grab-3D is a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency.<n>We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric head to explicitly inject 3D geometric awareness into temporal modeling.
arXiv Detail & Related papers (2025-12-15T18:54:30Z)
ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points [32.23473666846317]
We propose ControlVP, a user-guided framework for correcting vanishing point inconsistencies in generated images.<n>Our approach extends a pre-trained diffusion model by incorporating structural guidance derived from building contours.<n>Our method enhances global geometric consistency while maintaining visual fidelity comparable to the baselines.
arXiv Detail & Related papers (2025-12-08T12:38:11Z)
Look Around and Pay Attention: Multi-camera Point Tracking Reimagined with Transformers [5.025261312338861]
LAPA (Look Around and Pay Attention) is a novel end-to-end transformer-based architecture for multi-camera point tracking.<n>Instead of relying on classical triangulation, we construct 3D point representations via attention-weighted aggregation.<n>Experiments on challenging datasets, including our newly created multi-camera (MC) versions of TAPVid-3D panoptic and PointOdyssey, demonstrate that our unified approach significantly outperforms existing methods.
arXiv Detail & Related papers (2025-12-03T19:34:08Z)
GeoVideo: Introducing Geometric Regularization into Video Generation Model [46.38507581500745]
We introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction.<n>Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved structural coherence-temporal shape, consistency, and physical plausibility.
arXiv Detail & Related papers (2025-12-03T05:11:57Z)
TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction [57.46712611558817]
3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass.<n>Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry.<n>We propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies.
arXiv Detail & Related papers (2025-12-02T02:22:20Z)
4D Driving Scene Generation With Stereo Forcing [62.47705572424127]
Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization.<n>We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency.
arXiv Detail & Related papers (2025-09-24T15:37:17Z)
GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering [54.489285024494855]
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent.<n>Existing approaches, depending on the domain they operate, suffer from several issues that degrade the user experience.<n>We introduce textbfGaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent local reconstruction and rendering' paradigm.
arXiv Detail & Related papers (2025-06-30T15:24:27Z)
Breaking Down Monocular Ambiguity: Exploiting Temporal Evolution for 3D Lane Detection [79.98605061363999]
Monocular 3D lane detection aims to estimate the 3D position of lanes from frontal-view (FV) images.<n>Existing methods are constrained by the inherent ambiguity of single-frame input.<n>We propose to unlock the rich information embedded in the temporal evolution of the scene as the vehicle moves.
arXiv Detail & Related papers (2025-04-29T08:10:17Z)
GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors [47.21120442961684]
We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos.<n>We show that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.
arXiv Detail & Related papers (2025-04-01T17:58:03Z)
Attention meets Geometry: Geometry Guided Spatial-Temporal Attention for Consistent Self-Supervised Monocular Depth Estimation [42.249533907879126]
This paper explores how the increasingly popular transformer architecture, together with novel regularized loss formulations, can improve depth consistency. We propose a spatial attention module that correlates coarse depth predictions to aggregate local geometric information. A novel temporal attention mechanism further processes the local geometric information in a global context across consecutive images.
arXiv Detail & Related papers (2021-10-15T16:43:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.