Learn2Augment: Learning to Composite Videos for Data Augmentation in
Action Recognition
- URL: http://arxiv.org/abs/2206.04790v1
- Date: Thu, 9 Jun 2022 23:04:52 GMT
- Title: Learn2Augment: Learning to Composite Videos for Data Augmentation in
Action Recognition
- Authors: Shreyank N Gowda, Marcus Rohrbach, Frank Keller, Laura Sevilla-Lara
- Abstract summary: We learn what makes a good video for action recognition and select only high-quality samples for augmentation.
We learn which pairs of videos to augment without having to actually composite them.
We see improvements of up to 8.6% in the semi-supervised setting.
- Score: 47.470845728457135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address the problem of data augmentation for video action recognition.
Standard augmentation strategies in video are hand-designed and sample the
space of possible augmented data points either at random, without knowing which
augmented points will be better, or through heuristics. We propose to learn
what makes a good video for action recognition and select only high-quality
samples for augmentation. In particular, we choose video compositing of a
foreground and a background video as the data augmentation process, which
results in diverse and realistic new samples. We learn which pairs of videos to
augment without having to actually composite them. This reduces the space of
possible augmentations, which has two advantages: it saves computational cost
and increases the accuracy of the final trained classifier, as the augmented
pairs are of higher quality than average. We present experimental results on
the entire spectrum of training settings: few-shot, semi-supervised and fully
supervised. We observe consistent improvements across all of them over prior
work and baselines on Kinetics, UCF101, HMDB51, and achieve a new
state-of-the-art on settings with limited data. We see improvements of up to
8.6% in the semi-supervised setting.
Related papers
- ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning [29.620990627792906]
This paper presents a new self-supervised video representation learning framework, ARVideo, which autoregressively predicts the next video token in a tailored sequence order.
Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning.
arXiv Detail & Related papers (2024-05-24T02:29:03Z) - Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - A Feature-space Multimodal Data Augmentation Technique for Text-video
Retrieval [16.548016892117083]
Text-video retrieval methods have received increased attention over the past few years.
Data augmentation techniques were introduced to increase the performance on unseen test examples.
We propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples.
arXiv Detail & Related papers (2022-08-03T14:05:20Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Space-Time Crop & Attend: Improving Cross-modal Video Representation
Learning [88.71867887257274]
We show that spatial augmentations such as cropping work well for videos too, but that previous implementations could not do this at a scale sufficient for it to work well.
To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space.
Second, we show that as opposed to naive average pooling, the use of transformer-based attention performance improves significantly.
arXiv Detail & Related papers (2021-03-18T12:32:24Z) - VideoMix: Rethinking Data Augmentation for Video Classification [29.923635550986997]
State-of-the-art video action classifiers often suffer from overfitting.
Recent data augmentation strategies have been reported to address the overfitting problems.
VideoMix lets a model learn beyond the object and scene biases and extract more robust cues for action recognition.
arXiv Detail & Related papers (2020-12-07T05:40:33Z) - Spatiotemporal Contrastive Video Representation Learning [87.56145031149869]
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn visual representations from unlabeled videos.
Our representations are learned using a contrasttemporalive loss, where two augmented clips from the same short video are pulled together in the embedding space.
We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial.
arXiv Detail & Related papers (2020-08-09T19:58:45Z) - Feature Re-Learning with Data Augmentation for Video Relevance
Prediction [35.87597969685573]
Re-learning is realized by projecting a given deep feature into a new space by an affine transformation.
We propose a new data augmentation strategy which works directly on frame-level and video-level features.
arXiv Detail & Related papers (2020-04-08T05:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.