PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
- URL: http://arxiv.org/abs/2505.22564v1
- Date: Wed, 28 May 2025 16:42:10 GMT
- Title: PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
- Authors: Jaehyun Choi, Jiwan Hur, Gyojin Han, Jaemyung Yu, Junmo Kim,
- Abstract summary: This paper introduces PRISM, Progressive Refinement and Insertion for Sparse Motion, for video dataset condensation.<n>Unlike the previous method that separates static content from dynamic motion, our method preserves the essential interdependence between these elements.<n>Our approach progressively refines and inserts frames to fully accommodate the motion in an action while achieving better performance but less storage.
- Score: 22.804486552524885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video dataset condensation has emerged as a critical technique for addressing the computational challenges associated with large-scale video data processing in deep learning applications. While significant progress has been made in image dataset condensation, the video domain presents unique challenges due to the complex interplay between spatial content and temporal dynamics. This paper introduces PRISM, Progressive Refinement and Insertion for Sparse Motion, for video dataset condensation, a novel approach that fundamentally reconsiders how video data should be condensed. Unlike the previous method that separates static content from dynamic motion, our method preserves the essential interdependence between these elements. Our approach progressively refines and inserts frames to fully accommodate the motion in an action while achieving better performance but less storage, considering the relation of gradients for each frame. Extensive experiments across standard video action recognition benchmarks demonstrate that PRISM outperforms existing disentangled approaches while maintaining compact representations suitable for resource-constrained environments.
Related papers
- Motion-Aware Concept Alignment for Consistent Video Editing [57.08108545219043]
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
arXiv Detail & Related papers (2025-06-01T13:28:04Z) - Temporal Saliency-Guided Distillation: A Scalable Framework for Distilling Video Datasets [13.22969334943219]
We propose a novel uni-level video dataset distillation framework.<n>To address temporal redundancy and enhance motion preservation, we introduce a temporal saliency-guided filtering mechanism.<n>Our method achieves state-of-the-art performance, bridging the gap between real and distilled video data.
arXiv Detail & Related papers (2025-05-27T04:02:57Z) - Condensing Action Segmentation Datasets via Generative Network Inversion [37.78120420622088]
This work presents the first condensation approach for procedural video datasets used in temporal action segmentation.<n>We propose a condensation framework that leverages generative prior learned from the dataset and network inversion to condense data into compact latent codes.<n>Our evaluation on standard benchmarks demonstrates consistent effectiveness in condensing TAS datasets and achieving competitive performances.
arXiv Detail & Related papers (2025-03-18T10:29:47Z) - Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding [11.211803499867639]
We propose DYTO, a novel dynamic token merging framework for zero-shot video understanding.<n> DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences.<n>Experiments demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods.
arXiv Detail & Related papers (2024-11-21T18:30:11Z) - Event-based Video Frame Interpolation with Edge Guided Motion Refinement [28.331148083668857]
We introduce an end-to-end E-VFI learning method to efficiently utilize edge features from event signals for motion flow and warping enhancement.
Our method incorporates an Edge Guided Attentive (EGA) module, which rectifies estimated video motion through attentive aggregation.
Experiments on both synthetic and real datasets show the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-04-28T12:13:34Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Video Demoireing with Relation-Based Temporal Consistency [68.20281109859998]
Moire patterns, appearing as color distortions, severely degrade image and video qualities when filming a screen with digital cameras.
We study how to remove such undesirable moire patterns in videos, namely video demoireing.
arXiv Detail & Related papers (2022-04-06T17:45:38Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.