Related papers: UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

URL: http://arxiv.org/abs/2602.22960v1
Date: Thu, 26 Feb 2026 12:54:46 GMT
Title: UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models
Authors: Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, Song-Hai Zhang,
Abstract summary: We present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism.<n>We also introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting.
Score: 54.564740558030245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.

Related papers

DCDM: Divide-and-Conquer Diffusion Models for Consistency-Preserving Video Generation [77.89090846233906]
We propose a system-level framework, termed the Divide-and-Conquer Diffusion Model (DCDM)<n>DCDM decomposes video consistency modeling into three dedicated components while sharing a unified video generation backbone.<n>We validate our framework on the test set of the CVM Competition at AAAI'26, and the results demonstrate that the proposed strategies effectively address these challenges.
arXiv Detail & Related papers (2026-02-14T07:02:36Z)
RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z)
Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events [71.2439653098351]
Continuous space-time video super-STVSR has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary temporal scales.<n>We present EvEnhancer, a novel approach that marries unique properties of high temporal and high dynamic range encapsulated in event streams.<n>Our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining generalizability at OOD scales.
arXiv Detail & Related papers (2025-10-04T15:23:07Z)
DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving [9.882070476776274]
We present a generalizable camera simulation framework DriveCamSim.<n>Our core innovation lies in the proposed Explicit Camera Modeling mechanism.<n>For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines.
arXiv Detail & Related papers (2025-05-26T08:50:15Z)
Towards Efficient Real-Time Video Motion Transfer via Generative Time Series Modeling [7.3949576464066]
We propose a deep learning framework designed to significantly optimize bandwidth for motion-transfer-enabled video applications.<n>To capture complex motion effectively, we utilize the First Order Motion Model (FOMM), which encodes dynamic objects by detecting keypoints.<n>We validate our results across three datasets for video animation and reconstruction using the following metrics: Mean Absolute Error, Joint Embedding Predictive Architecture Embedding Distance, Structural Similarity Index, and Average Pair-wise Displacement.
arXiv Detail & Related papers (2025-04-07T22:21:54Z)
MultiViPerFrOG: A Globally Optimized Multi-Viewpoint Perception Framework for Camera Motion and Tissue Deformation [18.261678529996104]
We propose a framework that can flexibly integrate the output of low-level perception modules with kinematic and scene-modeling priors. Overall, our method shows robustness to combined noisy input measures and can process hundreds of points in a few milliseconds.
arXiv Detail & Related papers (2024-08-08T10:55:55Z)
Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework. It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z)
Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video. In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.