Exploring Temporally Dynamic Data Augmentation for Video Recognition
- URL: http://arxiv.org/abs/2206.15015v1
- Date: Thu, 30 Jun 2022 04:34:34 GMT
- Title: Exploring Temporally Dynamic Data Augmentation for Video Recognition
- Authors: Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang,
Dongyoon Wee, Sangyoun Lee
- Abstract summary: We propose a simple yet effective video data augmentation framework, DynaAugment.
The magnitude of augmentation operations on each frame is changed by an effective mechanism, Fourier Sampling.
We experimentally demonstrate that there are additional performance rooms to be improved from static augmentations on diverse video models.
- Score: 21.233868129923458
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data augmentation has recently emerged as an essential component of modern
training recipes for visual recognition tasks. However, data augmentation for
video recognition has been rarely explored despite its effectiveness. Few
existing augmentation recipes for video recognition naively extend the image
augmentation methods by applying the same operations to the whole video frames.
Our main idea is that the magnitude of augmentation operations for each frame
needs to be changed over time to capture the real-world video's temporal
variations. These variations should be generated as diverse as possible using
fewer additional hyper-parameters during training. Through this motivation, we
propose a simple yet effective video data augmentation framework, DynaAugment.
The magnitude of augmentation operations on each frame is changed by an
effective mechanism, Fourier Sampling that parameterizes diverse, smooth, and
realistic temporal variations. DynaAugment also includes an extended search
space suitable for video for automatic data augmentation methods. DynaAugment
experimentally demonstrates that there are additional performance rooms to be
improved from static augmentations on diverse video models. Specifically, we
show the effectiveness of DynaAugment on various video datasets and tasks:
large-scale video recognition (Kinetics-400 and Something-Something-v2),
small-scale video recognition (UCF- 101 and HMDB-51), fine-grained video
recognition (Diving-48 and FineGym), video action segmentation on Breakfast,
video action localization on THUMOS'14, and video object detection on MOT17Det.
DynaAugment also enables video models to learn more generalized representation
to improve the model robustness on the corrupted videos.
Related papers
- Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos.
Our model uses a novel autoregressive factorized decoding architecture.
Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z) - Just a Glimpse: Rethinking Temporal Information for Video Continual
Learning [58.7097258722291]
We propose a novel replay mechanism for effective video continual learning based on individual/single frames.
Under extreme memory constraints, video diversity plays a more significant role than temporal information.
Our method achieves state-of-the-art performance, outperforming the previous state-of-the-art by up to 21.49%.
arXiv Detail & Related papers (2023-05-28T19:14:25Z) - Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization [23.245275661852446]
We propose a self-supervised method for learning motion-focused video representations.
We learn similarities between videos with identical local motion dynamics but an otherwise different appearance.
Our approach maintains performance when using only 25% of the pretraining videos.
arXiv Detail & Related papers (2023-03-20T10:31:35Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Learn2Augment: Learning to Composite Videos for Data Augmentation in
Action Recognition [47.470845728457135]
We learn what makes a good video for action recognition and select only high-quality samples for augmentation.
We learn which pairs of videos to augment without having to actually composite them.
We see improvements of up to 8.6% in the semi-supervised setting.
arXiv Detail & Related papers (2022-06-09T23:04:52Z) - Learning Representational Invariances for Data-Efficient Action
Recognition [52.23716087656834]
We show that our data augmentation strategy leads to promising performance on the Kinetics-100, UCF-101, and HMDB-51 datasets.
We also validate our data augmentation strategy in the fully supervised setting and demonstrate improved performance.
arXiv Detail & Related papers (2021-03-30T17:59:49Z) - VideoMix: Rethinking Data Augmentation for Video Classification [29.923635550986997]
State-of-the-art video action classifiers often suffer from overfitting.
Recent data augmentation strategies have been reported to address the overfitting problems.
VideoMix lets a model learn beyond the object and scene biases and extract more robust cues for action recognition.
arXiv Detail & Related papers (2020-12-07T05:40:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.