VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
- URL: http://arxiv.org/abs/2506.18903v2
- Date: Thu, 10 Jul 2025 14:56:24 GMT
- Title: VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
- Authors: Runjia Li, Philip Torr, Andrea Vedaldi, Tomas Jakab,
- Abstract summary: We introduce Surfel-Indexed View Memory (VMem), a mechanism that remembers past views by indexing them geometrically based on the 3D surface elements they have observed.<n>VMem enables the efficient retrieval of the most relevant past views when generating new ones.<n>We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.
- Score: 55.73900731190389
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel memory mechanism to build video generators that can explore environments interactively. Similar results have previously been achieved by out-painting 2D views of the scene while incrementally reconstructing its 3D geometry, which quickly accumulates errors, or by video generators with a short context window, which struggle to maintain scene coherence over the long term. To address these limitations, we introduce Surfel-Indexed View Memory (VMem), a mechanism that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed. VMem enables the efficient retrieval of the most relevant past views when generating new ones. By focusing only on these relevant views, our method produces consistent explorations of imagined environments at a fraction of the computational cost of using all past views as context. We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.
Related papers
- Video World Models with Long-term Spatial Memory [110.530715838396]
We introduce a novel framework to enhance long-term consistency of video world models.<n>Our framework includes mechanisms to store and retrieve information from the long-term spatial memory.<n>Our evaluations show improved quality, consistency, and context length compared to relevant baselines.
arXiv Detail & Related papers (2025-06-05T17:42:34Z) - WorldExplorer: Towards Generating Fully Navigable 3D Scenes [49.21733308718443]
WorldExplorer builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints.<n>We generate multiple videos along short, pre-defined trajectories, that explore the scene in depth.<n>Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results.
arXiv Detail & Related papers (2025-06-02T15:41:31Z) - HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.<n>We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z) - ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations [33.74746234704817]
Video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description.<n>This is challenging as it involves deep vision-level understanding, pixel-level dense prediction andtemporal reasoning.<n>We propose bfReferDINO RVOS that inherits region-level vision-text alignment from foundational visual grounding models.
arXiv Detail & Related papers (2025-01-24T16:24:15Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - SceneScape: Text-Driven Consistent Scene Generation [14.348512536556413]
We introduce a novel framework that generates such videos in an online fashion by combining a pre-trained text-to-image model with a pre-trained monocular depth prediction model.
To tackle the pivotal challenge of achieving 3D consistency, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene.
In contrast to previous works, which are applicable only to limited domains, our method generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles.
arXiv Detail & Related papers (2023-02-02T14:47:19Z) - Virtual Correspondence: Humans as a Cue for Extreme-View Geometry [104.09449367670318]
We present a novel concept called virtual correspondences (VCs)
VCs conform with epipolar geometry; unlike classic correspondences, VCs do not need to be co-visible across views.
We show how VCs can be seamlessly integrated with classic bundle adjustment to recover camera poses across extreme views.
arXiv Detail & Related papers (2022-06-16T17:59:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.