Related papers: SplatVoxel: History-Aware Novel View Streaming without Temporal Training

SplatVoxel: History-Aware Novel View Streaming without Temporal Training

URL: http://arxiv.org/abs/2503.14698v1
Date: Tue, 18 Mar 2025 20:00:47 GMT
Title: SplatVoxel: History-Aware Novel View Streaming without Temporal Training
Authors: Yiming Wang, Lucy Chai, Xuan Luo, Michael Niemeyer, Manuel Lagunas, Stephen Lombardi, Siyu Tang, Tiancheng Sun,
Abstract summary: We study the problem of novel view streaming from sparse-view videos.<n>Existing novel view synthesis methods struggle with temporal coherence and visual fidelity.<n>We propose a hybrid splat-voxel feed-forward scene reconstruction approach.
Score: 29.759664150610362
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the problem of novel view streaming from sparse-view videos, which aims to generate a continuous sequence of high-quality, temporally consistent novel views as new input frames arrive. However, existing novel view synthesis methods struggle with temporal coherence and visual fidelity, leading to flickering and inconsistency. To address these challenges, we introduce history-awareness, leveraging previous frames to reconstruct the scene and improve quality and stability. We propose a hybrid splat-voxel feed-forward scene reconstruction approach that combines Gaussian Splatting to propagate information over time, with a hierarchical voxel grid for temporal fusion. Gaussian primitives are efficiently warped over time using a motion graph that extends 2D tracking models to 3D motion, while a sparse voxel transformer integrates new temporal observations in an error-aware manner. Crucially, our method does not require training on multi-view video datasets, which are currently limited in size and diversity, and can be directly applied to sparse-view video streams in a history-aware manner at inference time. Our approach achieves state-of-the-art performance in both static and streaming scene reconstruction, effectively reducing temporal artifacts and visual artifacts while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU. Project Page: https://19reborn.github.io/SplatVoxel/

Related papers

FlowR: Flowing from Sparse to Dense 3D Reconstructions [60.6368083163258]
We propose a flow matching model that learns a flow to connect novel view renderings to renderings that we expect from dense reconstructions. Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540x960 resolution (91K tokens) on one H100 GPU in a single forward pass.
arXiv Detail & Related papers (2025-04-02T11:57:01Z)
Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion. We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z)
SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models [10.66567645920237]
Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the garment while maintaining temporal consistency.<n>We reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions.<n>Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence.
arXiv Detail & Related papers (2024-12-13T14:50:26Z)
D-NPC: Dynamic Neural Point Clouds for Non-Rigid View Synthesis from Monocular Video [53.83936023443193]
This paper contributes to the field by introducing a new synthesis method for dynamic novel view from monocular video, such as smartphone captures.<n>Our approach represents the as a $textitdynamic neural point cloud$, an implicit time-conditioned point cloud that encodes local geometry and appearance in separate hash-encoded neural feature grids.
arXiv Detail & Related papers (2024-06-14T14:35:44Z)
Fast View Synthesis of Casual Videos with Soup-of-Planes [24.35962788109883]
Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. Our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100x faster in training and enabling real-time rendering.
arXiv Detail & Related papers (2023-12-04T18:55:48Z)
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
Fast Non-Rigid Radiance Fields from Monocularized Data [66.74229489512683]
This paper proposes a new method for full 360deg inward-facing novel view synthesis of non-rigidly deforming scenes. At the core of our method are 1) An efficient deformation module that decouples the processing of spatial and temporal information for accelerated training and inference; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field. In both cases, our method is significantly faster than previous methods, converging in less than 7 minutes and achieving real-time framerates at 1K resolution, while obtaining a higher visual accuracy for generated novel views.
arXiv Detail & Related papers (2022-12-02T18:51:10Z)
Progressive Temporal Feature Alignment Network for Video Inpainting [51.26380898255555]
Video convolution aims to fill in-temporal "corrupted regions" with plausible content. Current methods achieve this goal through attention, flow-based warping, or 3D temporal convolution. We propose 'Progressive Temporal Feature Alignment Network', which progressively enriches features extracted from the current frame with the warped feature from neighbouring frames.
arXiv Detail & Related papers (2021-04-08T04:50:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.