VideoMix: Rethinking Data Augmentation for Video Classification
- URL: http://arxiv.org/abs/2012.03457v1
- Date: Mon, 7 Dec 2020 05:40:33 GMT
- Title: VideoMix: Rethinking Data Augmentation for Video Classification
- Authors: Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Jinhyung Kim
- Abstract summary: State-of-the-art video action classifiers often suffer from overfitting.
Recent data augmentation strategies have been reported to address the overfitting problems.
VideoMix lets a model learn beyond the object and scene biases and extract more robust cues for action recognition.
- Score: 29.923635550986997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art video action classifiers often suffer from overfitting. They
tend to be biased towards specific objects and scene cues, rather than the
foreground action content, leading to sub-optimal generalization performances.
Recent data augmentation strategies have been reported to address the
overfitting problems in static image classifiers. Despite the effectiveness on
the static image classifiers, data augmentation has rarely been studied for
videos. For the first time in the field, we systematically analyze the efficacy
of various data augmentation strategies on the video classification task. We
then propose a powerful augmentation strategy VideoMix. VideoMix creates a new
training video by inserting a video cuboid into another video. The ground truth
labels are mixed proportionally to the number of voxels from each video. We
show that VideoMix lets a model learn beyond the object and scene biases and
extract more robust cues for action recognition. VideoMix consistently
outperforms other augmentation baselines on Kinetics and the challenging
Something-Something-V2 benchmarks. It also improves the weakly-supervised
action localization performance on THUMOS'14. VideoMix pretrained models
exhibit improved accuracies on the video detection task (AVA).
Related papers
- InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Overlooked Video Classification in Weakly Supervised Video Anomaly
Detection [4.162019309587633]
We study explicitly the power of video classification supervision using a BERT or LSTM.
With this BERT or LSTM, CNN features of all snippets of a video can be aggregated into a single feature which can be used for video classification.
This simple yet powerful video classification supervision, combined into the MIL framework, brings extraordinary performance improvement on all three major video anomaly detection datasets.
arXiv Detail & Related papers (2022-10-13T03:00:22Z) - Exploring Temporally Dynamic Data Augmentation for Video Recognition [21.233868129923458]
We propose a simple yet effective video data augmentation framework, DynaAugment.
The magnitude of augmentation operations on each frame is changed by an effective mechanism, Fourier Sampling.
We experimentally demonstrate that there are additional performance rooms to be improved from static augmentations on diverse video models.
arXiv Detail & Related papers (2022-06-30T04:34:34Z) - Learn2Augment: Learning to Composite Videos for Data Augmentation in
Action Recognition [47.470845728457135]
We learn what makes a good video for action recognition and select only high-quality samples for augmentation.
We learn which pairs of videos to augment without having to actually composite them.
We see improvements of up to 8.6% in the semi-supervised setting.
arXiv Detail & Related papers (2022-06-09T23:04:52Z) - Cross-modal Manifold Cutmix for Self-supervised Video Representation
Learning [50.544635516455116]
This paper focuses on designing video augmentation for self-supervised learning.
We first analyze the best strategy to mix videos to create a new augmented video sample.
We propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities.
arXiv Detail & Related papers (2021-12-07T18:58:33Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Enhancing Unsupervised Video Representation Learning by Decoupling the
Scene and the Motion [86.56202610716504]
Action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded.
We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
arXiv Detail & Related papers (2020-09-12T09:54:11Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.