Related papers: VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

URL: http://arxiv.org/abs/2506.18903v3
Date: Thu, 14 Aug 2025 14:03:30 GMT
Title: VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
Authors: Runjia Li, Philip Torr, Andrea Vedaldi, Tomas Jakab,
Abstract summary: We introduce Surfel-Indexed View Memory (VMem), a memory module that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed.<n>VMem enables efficient retrieval of the most relevant past views when generating new ones.
Score: 55.73900731190389
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a novel memory module for building video generators capable of interactively exploring environments. Previous approaches have achieved similar results either by out-painting 2D views of a scene while incrementally reconstructing its 3D geometry-which quickly accumulates errors-or by using video generators with a short context window, which struggle to maintain scene coherence over the long term. To address these limitations, we introduce Surfel-Indexed View Memory (VMem), a memory module that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed. VMem enables efficient retrieval of the most relevant past views when generating new ones. By focusing only on these relevant views, our method produces consistent explorations of imagined environments at a fraction of the computational cost required to use all past views as context. We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.

Related papers

Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context [33.99324999592141]
Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory.<n>Previous methods rely on video generation models with external memory for consistency.<n>We introduce geometry-as-context" to overcome these limitations.
arXiv Detail & Related papers (2026-02-25T14:09:03Z)
AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories [78.78355829813793]
Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history.<n>We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories.<n>Experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality.
arXiv Detail & Related papers (2026-02-16T17:23:08Z)
RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z)
3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation [55.29423122177883]
3DScenePrompt is a framework that generates the next chunk from arbitrary-length input.<n>It enables camera control and preserving scene consistency.<n>Our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality.
arXiv Detail & Related papers (2025-10-16T17:55:25Z)
Video World Models with Long-term Spatial Memory [110.530715838396]
We introduce a novel framework to enhance long-term consistency of video world models.<n>Our framework includes mechanisms to store and retrieve information from the long-term spatial memory.<n>Our evaluations show improved quality, consistency, and context length compared to relevant baselines.
arXiv Detail & Related papers (2025-06-05T17:42:34Z)
Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval [33.15952106579093]
We propose Context-as-Memory, which utilizes historical context as memory for video generation.<n>Considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module.<n>Experiments demonstrate that Context-as-Memory achieves superior memory capabilities in interactive long video generation compared to SOTAs.
arXiv Detail & Related papers (2025-06-03T17:59:05Z)
WorldExplorer: Towards Generating Fully Navigable 3D Scenes [49.21733308718443]
WorldExplorer builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints.<n>We generate multiple videos along short, pre-defined trajectories, that explore the scene in depth.<n>Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results.
arXiv Detail & Related papers (2025-06-02T15:41:31Z)
HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.<n>We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z)
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations [33.74746234704817]
Video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description.<n>This is challenging as it involves deep vision-level understanding, pixel-level dense prediction andtemporal reasoning.<n>We propose bfReferDINO RVOS that inherits region-level vision-text alignment from foundational visual grounding models.
arXiv Detail & Related papers (2025-01-24T16:24:15Z)
Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation. Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z)
SceneScape: Text-Driven Consistent Scene Generation [14.348512536556413]
We introduce a novel framework that generates such videos in an online fashion by combining a pre-trained text-to-image model with a pre-trained monocular depth prediction model. To tackle the pivotal challenge of achieving 3D consistency, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene. In contrast to previous works, which are applicable only to limited domains, our method generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles.
arXiv Detail & Related papers (2023-02-02T14:47:19Z)
Fast Non-Rigid Radiance Fields from Monocularized Data [66.74229489512683]
This paper proposes a new method for full 360deg inward-facing novel view synthesis of non-rigidly deforming scenes. At the core of our method are 1) An efficient deformation module that decouples the processing of spatial and temporal information for accelerated training and inference; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field. In both cases, our method is significantly faster than previous methods, converging in less than 7 minutes and achieving real-time framerates at 1K resolution, while obtaining a higher visual accuracy for generated novel views.
arXiv Detail & Related papers (2022-12-02T18:51:10Z)
Virtual Correspondence: Humans as a Cue for Extreme-View Geometry [104.09449367670318]
We present a novel concept called virtual correspondences (VCs) VCs conform with epipolar geometry; unlike classic correspondences, VCs do not need to be co-visible across views. We show how VCs can be seamlessly integrated with classic bundle adjustment to recover camera poses across extreme views.
arXiv Detail & Related papers (2022-06-16T17:59:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.