Extending Temporal Data Augmentation for Video Action Recognition
- URL: http://arxiv.org/abs/2211.04888v1
- Date: Wed, 9 Nov 2022 13:49:38 GMT
- Title: Extending Temporal Data Augmentation for Video Action Recognition
- Authors: Artjoms Gorpincenko, Michal Mackiewicz
- Abstract summary: We propose novel techniques to strengthen the relationship between the spatial and temporal domains.
The video action recognition results of our techniques outperform their respective variants in Top-1 and Top-5 settings on the UCF-101 and the HMDB-51 datasets.
- Score: 1.3807859854345832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pixel space augmentation has grown in popularity in many Deep Learning areas,
due to its effectiveness, simplicity, and low computational cost. Data
augmentation for videos, however, still remains an under-explored research
topic, as most works have been treating inputs as stacks of static images
rather than temporally linked series of data. Recently, it has been shown that
involving the time dimension when designing augmentations can be superior to
its spatial-only variants for video action recognition. In this paper, we
propose several novel enhancements to these techniques to strengthen the
relationship between the spatial and temporal domains and achieve a deeper
level of perturbations. The video action recognition results of our techniques
outperform their respective variants in Top-1 and Top-5 settings on the UCF-101
and the HMDB-51 datasets.
Related papers
- An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video [11.293897932762809]
Action recognition, an essential component of computer vision, plays a pivotal role in multiple applications.
CNNs suffer performance declines when trained with discontinuous video frames, which is a frequent scenario in real-world settings.
To overcome this issue, we introduce the 4A pipeline, which employs a series of sophisticated techniques.
arXiv Detail & Related papers (2024-04-10T04:59:51Z) - On the Importance of Spatial Relations for Few-shot Action Recognition [109.2312001355221]
In this paper, we investigate the importance of spatial relations and propose a more accurate few-shot action recognition method.
A novel Spatial Alignment Cross Transformer (SA-CT) learns to re-adjust the spatial relations and incorporates the temporal information.
Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks.
arXiv Detail & Related papers (2023-08-14T12:58:02Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Unsupervised Domain Adaptation for Video Transformers in Action
Recognition [76.31442702219461]
We propose a simple and novel UDA approach for video action recognition.
Our approach builds a robust source model that better generalises to target domain.
We report results on two video action benchmarks recognition for UDA.
arXiv Detail & Related papers (2022-07-26T12:17:39Z) - Exploring Temporally Dynamic Data Augmentation for Video Recognition [21.233868129923458]
We propose a simple yet effective video data augmentation framework, DynaAugment.
The magnitude of augmentation operations on each frame is changed by an effective mechanism, Fourier Sampling.
We experimentally demonstrate that there are additional performance rooms to be improved from static augmentations on diverse video models.
arXiv Detail & Related papers (2022-06-30T04:34:34Z) - Space-Time Crop & Attend: Improving Cross-modal Video Representation
Learning [88.71867887257274]
We show that spatial augmentations such as cropping work well for videos too, but that previous implementations could not do this at a scale sufficient for it to work well.
To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space.
Second, we show that as opposed to naive average pooling, the use of transformer-based attention performance improves significantly.
arXiv Detail & Related papers (2021-03-18T12:32:24Z) - TCLR: Temporal Contrastive Learning for Video Representation [49.6637562402604]
We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods.
With the commonly used 3D-ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification.
arXiv Detail & Related papers (2021-01-20T05:38:16Z) - Learning Temporally Invariant and Localizable Features via Data
Augmentation for Video Recognition [9.860323576151897]
In image recognition, learning spatially invariant features is a key factor in improving recognition performance and augmentation.
In this study, we extend these strategies to the temporal dimension for videos to learn temporally invariant or temporally local features.
Based on our novel temporal data augmentation algorithms, video recognition performances are improved using only a limited amount of training data.
arXiv Detail & Related papers (2020-08-13T06:56:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.