D2GV: Deformable 2D Gaussian Splatting for Video Representation in 400FPS
- URL: http://arxiv.org/abs/2503.05600v1
- Date: Fri, 07 Mar 2025 17:26:27 GMT
- Title: D2GV: Deformable 2D Gaussian Splatting for Video Representation in 400FPS
- Authors: Mufan Liu, Qi Yang, Miaoran Zhao, He Huang, Le Yang, Zhu Li, Yiling Xu,
- Abstract summary: Implicit Representations (INRs) have emerged as a powerful approach for video representation, offering versatility across tasks such as compression and inpainting.<n>We propose a novel video representation based on deformable 2D Gaussian splatting, dubbed D2GV.<n>We demonstrate D2GV's versatility in tasks including video, inpainting and denoising, underscoring its potential as a promising solution for video representation.
- Score: 22.373386953378002
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Implicit Neural Representations (INRs) have emerged as a powerful approach for video representation, offering versatility across tasks such as compression and inpainting. However, their implicit formulation limits both interpretability and efficacy, undermining their practicality as a comprehensive solution. We propose a novel video representation based on deformable 2D Gaussian splatting, dubbed D2GV, which aims to achieve three key objectives: 1) improved efficiency while delivering superior quality; 2) enhanced scalability and interpretability; and 3) increased friendliness for downstream tasks. Specifically, we initially divide the video sequence into fixed-length Groups of Pictures (GoP) to allow parallel training and linear scalability with video length. For each GoP, D2GV represents video frames by applying differentiable rasterization to 2D Gaussians, which are deformed from a canonical space into their corresponding timestamps. Notably, leveraging efficient CUDA-based rasterization, D2GV converges fast and decodes at speeds exceeding 400 FPS, while delivering quality that matches or surpasses state-of-the-art INRs. Moreover, we incorporate a learnable pruning and quantization strategy to streamline D2GV into a more compact representation. We demonstrate D2GV's versatility in tasks including video interpolation, inpainting and denoising, underscoring its potential as a promising solution for video representation. Code is available at: \href{https://github.com/Evan-sudo/D2GV}{https://github.com/Evan-sudo/D2GV}.
Related papers
- GaussianVideo: Efficient Video Representation and Compression by Gaussian Splatting [10.568851068989973]
Implicit Neural Representation for Videos (NeRV) has introduced a novel paradigm for video representation and compression.<n>We propose a new video representation and method based on 2D Gaussian Splatting to efficiently handle data handle.<n>Our method reduces memory usage by up to 78.4%, and significantly expedites video processing, achieving 5.5x faster training and 12.5x faster decoding.
arXiv Detail & Related papers (2025-03-06T11:31:08Z) - GSVC: Efficient Video Representation and Compression Through 2D Gaussian Splatting [3.479384894190067]
We propose GSVC, an approach to learning a set of 2D Gaussian splats that can effectively represent and compress video frames.<n>Experiment results show that GSVC achieves good rate-distortion trade-offs, comparable to state-of-the-art video codecs.
arXiv Detail & Related papers (2025-01-21T11:30:51Z) - VidTwin: Video VAE with Decoupled Structure and Dynamics [24.51768013474122]
VidTwin is a video autoencoder that decouples video into two distinct latent spaces.<n>Structure latent vectors capture overall content and global movement, and Dynamics latent vectors represent fine-grained details and rapid movements.<n>Experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality.
arXiv Detail & Related papers (2024-12-23T17:16:58Z) - Representing Long Volumetric Video with Temporal Gaussian Hierarchy [80.51373034419379]
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos.<n>We propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos.<n>This work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality.
arXiv Detail & Related papers (2024-12-12T18:59:34Z) - V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians [53.614560799043545]
V3 (Viewing Volumetric Videos) is a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians.
Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs.
As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience.
arXiv Detail & Related papers (2024-09-20T16:54:27Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting [94.84688557937123]
Video-3DGS is a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors.
Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos.
It enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
arXiv Detail & Related papers (2024-06-04T17:57:37Z) - MagicVideo: Efficient Video Generation With Latent Diffusion Models [76.95903791630624]
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo.
Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card.
We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content.
arXiv Detail & Related papers (2022-11-20T16:40:31Z) - Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes.
We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.