Versatile Video Tokenization with Generative 2D Gaussian Splatting
- URL: http://arxiv.org/abs/2508.11183v1
- Date: Fri, 15 Aug 2025 03:16:45 GMT
- Title: Versatile Video Tokenization with Generative 2D Gaussian Splatting
- Authors: Zhenghao Chen, Zicong Chen, Lei Liu, Yiming Wu, Dong Xu,
- Abstract summary: Video Transformer (GVT) is a versatile video tokenizer built upon a generative 2D Gaussian Splatting strategy.<n>GVT achieves the baseline-of-the-art video quality, outperforms MAGVIT-v2 in action recognition, and delivers comparable compression performance.
- Score: 21.242557918885012
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video tokenization procedure is critical for a wide range of video processing tasks. Most existing approaches directly transform video into fixed-grid and patch-wise tokens, which exhibit limited versatility. Spatially, uniformly allocating a fixed number of tokens often leads to over-encoding in low-information regions. Temporally, reducing redundancy remains challenging without explicitly distinguishing between static and dynamic content. In this work, we propose the Gaussian Video Transformer (GVT), a versatile video tokenizer built upon a generative 2D Gaussian Splatting (2DGS) strategy. We first extract latent rigid features from a video clip and represent them with a set of 2D Gaussians generated by our proposed Spatio-Temporal Gaussian Embedding (STGE) mechanism in a feed-forward manner. Such generative 2D Gaussians not only enhance spatial adaptability by assigning higher (resp., lower) rendering weights to regions with higher (resp., lower) information content during rasterization, but also improve generalization by avoiding per-video optimization.To enhance the temporal versatility, we introduce a Gaussian Set Partitioning (GSP) strategy that separates the 2D Gaussians into static and dynamic sets, which explicitly model static content shared across different time-steps and dynamic content specific to each time-step, enabling a compact representation.We primarily evaluate GVT on the video reconstruction, while also assessing its performance on action recognition and compression using the UCF101, Kinetics, and DAVIS datasets. Extensive experiments demonstrate that GVT achieves a state-of-the-art video reconstruction quality, outperforms the baseline MAGVIT-v2 in action recognition, and delivers comparable compression performance.
Related papers
- Contour Information Aware 2D Gaussian Splatting for Image Representation [0.0]
We propose a Contour Information-Aware 2D Gaussian Splatting framework.<n>Our method achieves higher reconstruction quality around object edges compared to existing 2DGS methods.
arXiv Detail & Related papers (2025-12-29T07:24:36Z) - D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos [12.24209693552492]
Free-viewpoint video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representations remains a major challenge.<n>This paper presents Feedforward Compression of Dynamic Gaussian Splatting (D-FCGS), a novel feedforward framework for compressing temporally correlated Gaussian point cloud sequences.<n> Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression in under 2 seconds.
arXiv Detail & Related papers (2025-07-08T10:39:32Z) - 4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video [56.04182926886754]
3D Gaussian Splatting (3DGS) has substantial potential for enabling photorealistic Free-Viewpoint Video (FVV) experiences.<n>Existing methods typically handle dynamic 3DGS representation and compression separately, motion information and the rate-distortion trade-off during training.<n>We propose 4DGC, a rate-aware 4D Gaussian compression framework that significantly reduces storage size while maintaining superior RD performance for FVV.
arXiv Detail & Related papers (2025-03-24T08:05:27Z) - D2GV: Deformable 2D Gaussian Splatting for Video Representation in 400FPS [22.373386953378002]
Implicit Representations (INRs) have emerged as a powerful approach for video representation, offering versatility across tasks such as compression and inpainting.<n>We propose a novel video representation based on deformable 2D Gaussian splatting, dubbed D2GV.<n>We demonstrate D2GV's versatility in tasks including video, inpainting and denoising, underscoring its potential as a promising solution for video representation.
arXiv Detail & Related papers (2025-03-07T17:26:27Z) - GaussianVideo: Efficient Video Representation and Compression by Gaussian Splatting [10.568851068989973]
Implicit Neural Representation for Videos (NeRV) has introduced a novel paradigm for video representation and compression.<n>We propose a new video representation and method based on 2D Gaussian Splatting to efficiently handle data handle.<n>Our method reduces memory usage by up to 78.4%, and significantly expedites video processing, achieving 5.5x faster training and 12.5x faster decoding.
arXiv Detail & Related papers (2025-03-06T11:31:08Z) - Efficient Gaussian Splatting for Monocular Dynamic Scene Rendering via Sparse Time-Variant Attribute Modeling [64.84686527988809]
Deformable Gaussian Splatting has emerged as a robust solution to represent real-world dynamic scenes.<n>Our approach formulates dynamic scenes using a sparse anchor-grid representation, with the motion flow of dense Gaussians calculated via a classical kernel representation.<n>Experiments on two real-world datasets demonstrate that our EDGS significantly improves the rendering speed with superior rendering quality.
arXiv Detail & Related papers (2025-02-27T18:53:06Z) - VidTwin: Video VAE with Decoupled Structure and Dynamics [24.51768013474122]
VidTwin is a compact video autoencoder that decouples video into two distinct latent spaces.<n>Structure latent vectors capture overall content and global movement, and Dynamics latent vectors represent fine-grained details and rapid movements.<n>Experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality.
arXiv Detail & Related papers (2024-12-23T17:16:58Z) - Representing Long Volumetric Video with Temporal Gaussian Hierarchy [80.51373034419379]
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos.<n>We propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos.<n>This work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality.
arXiv Detail & Related papers (2024-12-12T18:59:34Z) - MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo [54.00987996368157]
We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS)
MVSGaussian achieves real-time rendering with better synthesis quality for each scene.
arXiv Detail & Related papers (2024-05-20T17:59:30Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes.
We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.