RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space
- URL: http://arxiv.org/abs/2602.20685v2
- Date: Wed, 25 Feb 2026 05:17:17 GMT
- Title: RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space
- Authors: Yichen Xie, Chensheng Peng, Mazen Abdelfattah, Yihan Hu, Jiezhi Yang, Eric Higgins, Ryan Brigden, Masayoshi Tomizuka, Wei Zhan,
- Abstract summary: RAYNOVA is a multiview world model for driving scenarios that employs a dual-causal autoregressive framework.<n>It constructs an isotropic-temporal representation across views, frames, and scales based on relative Plcker-ray positional encoding.
- Score: 51.441415833480505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-agonistic multiview world model for driving scenarios that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at https://raynova-ai.github.io/.
Related papers
- Beyond Pixel Histories: World Models with Persistent 3D State [50.4601060508243]
PERSIST is a new paradigm of world model which simulates the evolution of a latent 3D scene.<n>We show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods.
arXiv Detail & Related papers (2026-03-03T19:58:31Z) - StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation [6.0744834626758495]
StemVLA is a novel framework that explicitly incorporates both future-oriented 3D spatial knowledge and historical 4D representations into action prediction.<n>We show that StemVLA significantly improves long-horizon task success and state-of-the-art performance on the CALVIN ABC-D benchmark [46], achieving an average sequence length of XXX.
arXiv Detail & Related papers (2026-02-27T06:43:37Z) - TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model [53.555353366322464]
We present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system.<n>Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible synthesis systems.
arXiv Detail & Related papers (2025-12-31T18:31:46Z) - Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement [6.91111219679588]
Blur2Sharp is a novel framework integrating 3D-aware neural rendering and diffusion models to generate sharp, geometrically consistent novel-view images.<n>Our method employs a dual-conditioning architecture: first, a Human NeRF model generates geometrically coherent multi-view renderings for target poses, explicitly encoding 3D structural guidance.<n>We further enhance visual quality through hierarchical feature fusion, incorporating texture, normal, and semantic priors extracted from parametric SMPL models to simultaneously improve global coherence and local detail accuracy.
arXiv Detail & Related papers (2025-12-09T03:49:12Z) - UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework [54.337290937468175]
We propose UniMo, an autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework.<n>We show that our method simultaneously generates corresponding videos and motions while performing accurate motion capture.
arXiv Detail & Related papers (2025-12-03T16:03:18Z) - OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting [78.70702961852119]
OracleGS reconciles generative completeness with regressive fidelity for sparse view Gaussian Splatting.<n>Our approach conditions the powerful generative prior on multi-view geometric evidence, filtering hallucinatory artifacts while preserving plausible completions in under-constrained regions.
arXiv Detail & Related papers (2025-09-27T11:19:32Z) - Epona: Autoregressive Diffusion World Model for Autonomous Driving [39.389981627403316]
Existing video diffusion models struggle with flexible-length, long-horizon predictions and integrating trajectory planning.<n>This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences.<n>We propose Epona, an autoregressive world model that enables localized distribution modeling.
arXiv Detail & Related papers (2025-06-30T17:56:35Z) - DSG-World: Learning a 3D Gaussian World Model from Dual State Videos [14.213608866611784]
We present DSG-World, a novel end-to-end framework that explicitly constructs a 3D Gaussian World model from Dual State observations.<n>Our approach builds dual segmentation-aware Gaussian fields and enforces bidirectional photometric and semantic consistency.
arXiv Detail & Related papers (2025-06-05T16:33:32Z) - Learning 3D Persistent Embodied World Models [84.40585374179037]
We introduce a new persistent embodied world model with an explicit memory of previously generated content.<n>During generation time, our video diffusion model predicts RGB-D video of the future observations of the agent.<n>This generation is then aggregated into a persistent 3D map of the environment.
arXiv Detail & Related papers (2025-05-05T17:59:17Z) - Seeing World Dynamics in a Nutshell [132.79736435144403]
NutWorld is a framework that transforms monocular videos into dynamic 3D representations in a single forward pass.<n>We demonstrate that NutWorld achieves high-fidelity video reconstruction quality while enabling downstream applications in real-time.
arXiv Detail & Related papers (2025-02-05T18:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.