Related papers: GVD: Guiding Video Diffusion Model for Scalable Video Distillation

GVD: Guiding Video Diffusion Model for Scalable Video Distillation

URL: http://arxiv.org/abs/2507.22360v1
Date: Wed, 30 Jul 2025 03:51:35 GMT
Title: GVD: Guiding Video Diffusion Model for Scalable Video Distillation
Authors: Kunyang Li, Jeffrey A Chan Santiago, Sarinda Dhanesh Samarasinghe, Gaowen Liu, Mubarak Shah,
Abstract summary: Video dataset distillation aims to capture spatial and temporal information in a significantly smaller dataset.<n>We propose GVD: Guiding Video Diffusion, the first diffusion-based video distillation method.<n>Our method's diverse yet representative distillations significantly outperform previous state-of-the-art approaches on the MiniUCF and HMDB51 datasets.
Score: 45.67255330446926
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To address the larger computation and storage requirements associated with large video datasets, video dataset distillation aims to capture spatial and temporal information in a significantly smaller dataset, such that training on the distilled data has comparable performance to training on all of the data. We propose GVD: Guiding Video Diffusion, the first diffusion-based video distillation method. GVD jointly distills spatial and temporal features, ensuring high-fidelity video generation across diverse actions while capturing essential motion information. Our method's diverse yet representative distillations significantly outperform previous state-of-the-art approaches on the MiniUCF and HMDB51 datasets across 5, 10, and 20 Instances Per Class (IPC). Specifically, our method achieves 78.29 percent of the original dataset's performance using only 1.98 percent of the total number of frames in MiniUCF. Additionally, it reaches 73.83 percent of the performance with just 3.30 percent of the frames in HMDB51. Experimental results across benchmark video datasets demonstrate that GVD not only achieves state-of-the-art performance but can also generate higher resolution videos and higher IPC without significantly increasing computational cost.

Related papers

Dynamic-Aware Video Distillation: Optimizing Temporal Resolution Based on Video Semantics [68.85010825225528]
Video datasets present unique challenges due to the presence of temporal information and varying levels of redundancy across different classes.<n>Existing DD approaches assume a uniform level of temporal redundancy across all different video semantics, which limits their effectiveness on video datasets.<n>We propose Dynamic-Aware Video Distillation (DAViD), a Reinforcement Learning (RL) approach to predict the optimal Temporal Resolution of the synthetic videos.
arXiv Detail & Related papers (2025-05-28T11:43:58Z)
Temporal Saliency-Guided Distillation: A Scalable Framework for Distilling Video Datasets [13.22969334943219]
We propose a novel uni-level video dataset distillation framework.<n>To address temporal redundancy and enhance motion preservation, we introduce a temporal saliency-guided filtering mechanism.<n>Our method achieves state-of-the-art performance, bridging the gap between real and distilled video data.
arXiv Detail & Related papers (2025-05-27T04:02:57Z)
Video Dataset Condensation with Diffusion Models [7.44997213284633]
Video dataset distillation is a promising solution to generate a compact synthetic dataset that retains the essential information from a large real dataset.<n>In this paper, we focus on video dataset distillation by employing a video diffusion model to generate high-quality synthetic videos.<n>To enhance representativeness, we introduce Video Spatio-Temporal U-Net (VST-UNet), a model designed to select a diverse and informative subset of videos.<n>We validate the effectiveness of our approach through extensive experiments on four benchmark datasets, demonstrating performance improvements of up to (10.61%) over the state-of-the
arXiv Detail & Related papers (2025-05-10T15:12:19Z)
Latent Video Dataset Distillation [6.028880672839687]
We introduce a novel video dataset distillation approach that operates in the latent space.<n>We employ a diversity-aware data selection strategy to select both representative and diverse samples.<n>We also introduce a simple, training-free method to further compress the latent dataset.
arXiv Detail & Related papers (2025-04-23T22:50:39Z)
AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset [55.82208863521353]
We propose AccVideo to reduce the inference steps for accelerating video diffusion models with synthetic dataset.<n>Our model achieves 8.5x improvements in generation speed compared to the teacher model.<n>Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution.
arXiv Detail & Related papers (2025-03-25T08:52:07Z)
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources. This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z)
Differentially Private Video Activity Recognition [79.36113764129092]
We propose Multi-Clip DP-SGD, a novel framework for enforcing video-level differential privacy through clip-based classification models. Our approach achieves 81% accuracy with a privacy budget of epsilon=5 on UCF-101, marking a 76% improvement compared to a direct application of DP-SGD.
arXiv Detail & Related papers (2023-06-27T18:47:09Z)
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [49.68815656405452]
We show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP) We are inspired by the recent ImageMAE and propose customized video tube masking and reconstruction. Our VideoMAE with the vanilla ViT backbone can achieve 83.9% on Kinects-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51 without using any extra data.
arXiv Detail & Related papers (2022-03-23T17:55:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.