Related papers: Extending Temporal Data Augmentation for Video Action Recognition

Extending Temporal Data Augmentation for Video Action Recognition

URL: http://arxiv.org/abs/2211.04888v1
Date: Wed, 9 Nov 2022 13:49:38 GMT
Title: Extending Temporal Data Augmentation for Video Action Recognition
Authors: Artjoms Gorpincenko, Michal Mackiewicz
Abstract summary: We propose novel techniques to strengthen the relationship between the spatial and temporal domains. The video action recognition results of our techniques outperform their respective variants in Top-1 and Top-5 settings on the UCF-101 and the HMDB-51 datasets.
Score: 1.3807859854345832
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pixel space augmentation has grown in popularity in many Deep Learning areas, due to its effectiveness, simplicity, and low computational cost. Data augmentation for videos, however, still remains an under-explored research topic, as most works have been treating inputs as stacks of static images rather than temporally linked series of data. Recently, it has been shown that involving the time dimension when designing augmentations can be superior to its spatial-only variants for video action recognition. In this paper, we propose several novel enhancements to these techniques to strengthen the relationship between the spatial and temporal domains and achieve a deeper level of perturbations. The video action recognition results of our techniques outperform their respective variants in Top-1 and Top-5 settings on the UCF-101 and the HMDB-51 datasets.

Related papers

Dynamic-Aware Video Distillation: Optimizing Temporal Resolution Based on Video Semantics [68.85010825225528]
Video datasets present unique challenges due to the presence of temporal information and varying levels of redundancy across different classes.<n>Existing DD approaches assume a uniform level of temporal redundancy across all different video semantics, which limits their effectiveness on video datasets.<n>We propose Dynamic-Aware Video Distillation (DAViD), a Reinforcement Learning (RL) approach to predict the optimal Temporal Resolution of the synthetic videos.
arXiv Detail & Related papers (2025-05-28T11:43:58Z)
UnDIVE: Generalized Underwater Video Enhancement Using Generative Priors [9.438388237767105]
We propose a two-stage framework for enhancing underwater videos. The first stage uses a denoising diffusion descriptive model to learn a generative prior from unlabeled data. In the second stage, this prior is incorporated into a physics-based image formulation for spatial enhancement. Our method enables real-time and computationally-efficient processing of high-resolution underwater videos at lower resolutions.
arXiv Detail & Related papers (2024-11-08T11:16:36Z)
An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video [11.293897932762809]
Action recognition, an essential component of computer vision, plays a pivotal role in multiple applications. CNNs suffer performance declines when trained with discontinuous video frames, which is a frequent scenario in real-world settings. To overcome this issue, we introduce the 4A pipeline, which employs a series of sophisticated techniques.
arXiv Detail & Related papers (2024-04-10T04:59:51Z)
On the Importance of Spatial Relations for Few-shot Action Recognition [109.2312001355221]
In this paper, we investigate the importance of spatial relations and propose a more accurate few-shot action recognition method. A novel Spatial Alignment Cross Transformer (SA-CT) learns to re-adjust the spatial relations and incorporates the temporal information. Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks.
arXiv Detail & Related papers (2023-08-14T12:58:02Z)
Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions. We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z)
Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z)
Unsupervised Domain Adaptation for Video Transformers in Action Recognition [76.31442702219461]
We propose a simple and novel UDA approach for video action recognition. Our approach builds a robust source model that better generalises to target domain. We report results on two video action benchmarks recognition for UDA.
arXiv Detail & Related papers (2022-07-26T12:17:39Z)
Exploring Temporally Dynamic Data Augmentation for Video Recognition [21.233868129923458]
We propose a simple yet effective video data augmentation framework, DynaAugment. The magnitude of augmentation operations on each frame is changed by an effective mechanism, Fourier Sampling. We experimentally demonstrate that there are additional performance rooms to be improved from static augmentations on diverse video models.
arXiv Detail & Related papers (2022-06-30T04:34:34Z)
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning [88.71867887257274]
We show that spatial augmentations such as cropping work well for videos too, but that previous implementations could not do this at a scale sufficient for it to work well. To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space. Second, we show that as opposed to naive average pooling, the use of transformer-based attention performance improves significantly.
arXiv Detail & Related papers (2021-03-18T12:32:24Z)
TCLR: Temporal Contrastive Learning for Video Representation [49.6637562402604]
We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods. With the commonly used 3D-ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification.
arXiv Detail & Related papers (2021-01-20T05:38:16Z)
Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition [9.860323576151897]
In image recognition, learning spatially invariant features is a key factor in improving recognition performance and augmentation. In this study, we extend these strategies to the temporal dimension for videos to learn temporally invariant or temporally local features. Based on our novel temporal data augmentation algorithms, video recognition performances are improved using only a limited amount of training data.
arXiv Detail & Related papers (2020-08-13T06:56:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.