Related papers: V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians

V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians

URL: http://arxiv.org/abs/2409.13648v2
Date: Mon, 23 Sep 2024 08:04:53 GMT
Title: V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians
Authors: Penghao Wang, Zhirui Zhang, Liao Wang, Kaixin Yao, Siyuan Xie, Jingyi Yu, Minye Wu, Lan Xu,
Abstract summary: V3 (Viewing Volumetric Videos) is a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians. Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs. As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience.
Score: 53.614560799043545
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Experiencing high-fidelity volumetric video as seamlessly as 2D videos is a long-held dream. However, current dynamic 3DGS methods, despite their high rendering quality, face challenges in streaming on mobile devices due to computational and bandwidth constraints. In this paper, we introduce V^3 (Viewing Volumetric Videos), a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians. Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs. Additionally, we propose a two-stage training strategy to reduce storage requirements with rapid training speed. The first stage employs hash encoding and shallow MLP to learn motion, then reduces the number of Gaussians through pruning to meet the streaming requirements, while the second stage fine tunes other Gaussian attributes using residual entropy loss and temporal loss to improve temporal continuity. This strategy, which disentangles motion and appearance, maintains high rendering quality with compact storage requirements. Meanwhile, we designed a multi-platform player to decode and render 2D Gaussian videos. Extensive experiments demonstrate the effectiveness of V^3, outperforming other methods by enabling high-quality rendering and streaming on common devices, which is unseen before. As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience, including smooth scrolling and instant sharing. Our project page with source code is available at https://authoritywang.github.io/v3/.

Related papers

Taming Diffusion Transformer for Real-Time Mobile Video Generation [72.20660234882594]
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones.<n>We propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms.
arXiv Detail & Related papers (2025-07-17T17:59:10Z)
GSVR: 2D Gaussian-based Video Representation for 800+ FPS with Hybrid Deformation Field [7.977026024810772]
Implicit neural representations for video have been recognized as a novel and promising video representation.<n>We propose GSVR, a novel 2D Gaussian-based video representation, which achieves 800+ FPS and 35+ PSNR on Bunny.<n>Our method converges much faster than existing methods and also has 10x faster decoding speed compared to other methods.
arXiv Detail & Related papers (2025-07-08T02:13:12Z)
Neural Video Compression using 2D Gaussian Splatting [0.0]
We propose a region-of-interest based neural video compression model that leverages 2D Gaussian Splatting.<n>In this work, we designed a video pipeline that speeds up the encoding time of the previous Gaussian splatting-based image by 88%.
arXiv Detail & Related papers (2025-05-14T12:23:53Z)
D2GV: Deformable 2D Gaussian Splatting for Video Representation in 400FPS [22.373386953378002]
Implicit Representations (INRs) have emerged as a powerful approach for video representation, offering versatility across tasks such as compression and inpainting. We propose a novel video representation based on deformable 2D Gaussian splatting, dubbed D2GV. We demonstrate D2GV's versatility in tasks including video, inpainting and denoising, underscoring its potential as a promising solution for video representation.
arXiv Detail & Related papers (2025-03-07T17:26:27Z)
DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training [85.04885553561164]
Diffusion Transformers (DiTs) have shown remarkable performance in generating high-quality videos. DiTs can consume up to 95% of processing time and demand specialized context parallelism. This paper introduces DSV to accelerate video DiT training by leveraging the dynamic attention sparsity we empirically observe.
arXiv Detail & Related papers (2025-02-11T14:39:59Z)
GSVC: Efficient Video Representation and Compression Through 2D Gaussian Splatting [3.479384894190067]
We propose GSVC, an approach to learning a set of 2D Gaussian splats that can effectively represent and compress video frames. Experiment results show that GSVC achieves good rate-distortion trade-offs, comparable to state-of-the-art video codecs.
arXiv Detail & Related papers (2025-01-21T11:30:51Z)
Representing Long Volumetric Video with Temporal Gaussian Hierarchy [80.51373034419379]
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. We propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. This work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality.
arXiv Detail & Related papers (2024-12-12T18:59:34Z)
HiCoM: Hierarchical Coherent Motion for Streamable Dynamic Scene with 3D Gaussian Splatting [7.507657419706855]
This paper proposes an efficient framework, dubbed HiCoM, with three key components. First, we construct a compact and robust initial 3DGS representation using a perturbation smoothing strategy. Next, we introduce a Hierarchical Coherent Motion mechanism that leverages the inherent non-uniform distribution and local consistency of 3D Gaussians. Experiments conducted on two widely used datasets show that our framework improves learning efficiency of the state-of-the-art methods by about $20%$.
arXiv Detail & Related papers (2024-11-12T04:40:27Z)
Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache) We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z)
Robust Dual Gaussian Splatting for Immersive Human-centric Volumetric Videos [44.50599475213118]
We present a novel approach, dubbed textitDualGS, for real-time and high-fidelity playback of complex human performance. Our approach achieves a compression ratio of up to 120 times, only requiring approximately 350KB of storage per frame. We demonstrate the efficacy of our representation through photo-realistic, free-view experiences on VR headsets.
arXiv Detail & Related papers (2024-09-12T18:33:13Z)
SwinGS: Sliding Window Gaussian Splatting for Volumetric Video Streaming with Arbitrary Length [2.4844080708094745]
This paper introduces SwinGS, a framework for training, delivering, and rendering volumetric video in a real-time streaming fashion. We show that SwinGS reduces transmission costs by 83.6% compared to previous work with ignorable compromise in PSNR. We also develop an interactive WebGL viewer enabling real-time volumetric video playback on most devices with modern browsers.
arXiv Detail & Related papers (2024-09-12T05:33:15Z)
Compact 3D Gaussian Splatting for Static and Dynamic Radiance Fields [13.729716867839509]
We propose a learnable mask strategy that significantly reduces the number of Gaussians while preserving high performance. In addition, we propose a compact but effective representation of view-dependent color by employing a grid-based neural field. Our work provides a comprehensive framework for 3D scene representation, achieving high performance, fast training, compactness, and real-time rendering.
arXiv Detail & Related papers (2024-08-07T14:56:34Z)
Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting [94.84688557937123]
Video-3DGS is a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors. Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos. It enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
arXiv Detail & Related papers (2024-06-04T17:57:37Z)
Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation [35.52770785430601]
We propose a novel hybrid video autoencoder, called HVtemporalDM, which can capture intricate dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video. Our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details.
arXiv Detail & Related papers (2024-02-21T11:46:16Z)
VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams [56.00479598817949]
VideoRF is the first approach to enable real-time streaming and rendering of dynamic radiance fields on mobile platforms. We show that the feature image stream can be efficiently compressed by 2D video codecs. We have developed a real-time interactive player that enables online streaming and rendering of dynamic scenes.
arXiv Detail & Related papers (2023-12-03T14:14:35Z)
GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z)
Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes. We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.