Representing Long Volumetric Video with Temporal Gaussian Hierarchy
- URL: http://arxiv.org/abs/2412.09608v1
- Date: Thu, 12 Dec 2024 18:59:34 GMT
- Title: Representing Long Volumetric Video with Temporal Gaussian Hierarchy
- Authors: Zhen Xu, Yinghao Xu, Zhiyuan Yu, Sida Peng, Jiaming Sun, Hujun Bao, Xiaowei Zhou,
- Abstract summary: This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos.
We propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos.
This work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality.
- Score: 80.51373034419379
- License:
- Abstract: This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. Recent dynamic view synthesis methods leverage powerful 4D representations, like feature grids or point cloud sequences, to achieve high-quality rendering results. However, they are typically limited to short (1~2s) video clips and often suffer from large memory footprints when dealing with longer videos. To solve this issue, we propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. Our key observation is that there are generally various degrees of temporal redundancy in dynamic scenes, which consist of areas changing at different speeds. Motivated by this, our approach builds a multi-level hierarchy of 4D Gaussian primitives, where each level separately describes scene regions with different degrees of content change, and adaptively shares Gaussian primitives to represent unchanged scene content over different temporal segments, thus effectively reducing the number of Gaussian primitives. In addition, the tree-like structure of the Gaussian hierarchy allows us to efficiently represent the scene at a particular moment with a subset of Gaussian primitives, leading to nearly constant GPU memory usage during the training or rendering regardless of the video length. Extensive experimental results demonstrate the superiority of our method over alternative methods in terms of training cost, rendering speed, and storage usage. To our knowledge, this work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality. Our project page is available at: https://zju3dv.github.io/longvolcap.
Related papers
- GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting [28.981174430968643]
We introduce a novel neural representation that combines 3D Gaussian splatting with continuous camera motion modeling.
Experimental results show that our hierarchical learning, combined with robust camera motion modeling, captures complex dynamic scenes with strong temporal consistency.
This memory-efficient approach achieves high-quality rendering at impressive speeds.
arXiv Detail & Related papers (2025-01-08T19:01:12Z) - 4D Gaussian Splatting with Scale-aware Residual Field and Adaptive Optimization for Real-time Rendering of Temporally Complex Dynamic Scenes [19.24815625343669]
SaRO-GS is a novel dynamic scene representation capable of achieving real-time rendering.
To handle temporally complex dynamic scenes, we introduce a Scale-aware Residual Field.
Our method has demonstrated state-of-the-art performance.
arXiv Detail & Related papers (2024-12-09T08:44:19Z) - Video Decomposition Prior: A Methodology to Decompose Videos into Layers [74.36790196133505]
This paper introduces a novel video decomposition prior VDP' framework which derives inspiration from professional video editing practices.
VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels.
We address tasks such as video object segmentation, dehazing, and relighting.
arXiv Detail & Related papers (2024-12-06T10:35:45Z) - Compact 3D Gaussian Splatting for Static and Dynamic Radiance Fields [13.729716867839509]
We propose a learnable mask strategy that significantly reduces the number of Gaussians while preserving high performance.
In addition, we propose a compact but effective representation of view-dependent color by employing a grid-based neural field.
Our work provides a comprehensive framework for 3D scene representation, achieving high performance, fast training, compactness, and real-time rendering.
arXiv Detail & Related papers (2024-08-07T14:56:34Z) - A Hierarchical 3D Gaussian Representation for Real-Time Rendering of Very Large Datasets [45.13531064740826]
We introduce a hierarchy of 3D Gaussians that preserves visual quality for very large scenes.
We offer an efficient Level-of-Detail (LOD) solution for efficient rendering of distant content.
We show results for captured scenes with up to tens of thousands of images with a simple and affordable rig.
arXiv Detail & Related papers (2024-06-17T20:40:18Z) - VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction [59.40711222096875]
We present VastGaussian, the first method for high-quality reconstruction and real-time rendering on large scenes based on 3D Gaussian Splatting.
Our approach outperforms existing NeRF-based methods and achieves state-of-the-art results on multiple large scene datasets.
arXiv Detail & Related papers (2024-02-27T11:40:50Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Fast Non-Rigid Radiance Fields from Monocularized Data [66.74229489512683]
This paper proposes a new method for full 360deg inward-facing novel view synthesis of non-rigidly deforming scenes.
At the core of our method are 1) An efficient deformation module that decouples the processing of spatial and temporal information for accelerated training and inference; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field.
In both cases, our method is significantly faster than previous methods, converging in less than 7 minutes and achieving real-time framerates at 1K resolution, while obtaining a higher visual accuracy for generated novel views.
arXiv Detail & Related papers (2022-12-02T18:51:10Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Efficient training for future video generation based on hierarchical
disentangled representation of latent variables [66.94698064734372]
We propose a novel method for generating future prediction videos with less memory usage than the conventional methods.
We achieve high-efficiency by training our method in two stages: (1) image reconstruction to encode video frames into latent variables, and (2) latent variable prediction to generate the future sequence.
Our experiments show that the proposed method can efficiently generate future prediction videos, even for complex datasets that cannot be handled by previous methods.
arXiv Detail & Related papers (2021-06-07T10:43:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.