Learning Temporally Invariant and Localizable Features via Data
Augmentation for Video Recognition
- URL: http://arxiv.org/abs/2008.05721v1
- Date: Thu, 13 Aug 2020 06:56:52 GMT
- Title: Learning Temporally Invariant and Localizable Features via Data
Augmentation for Video Recognition
- Authors: Taeoh Kim, Hyeongmin Lee, MyeongAh Cho, Ho Seong Lee, Dong Heon Cho,
Sangyoun Lee
- Abstract summary: In image recognition, learning spatially invariant features is a key factor in improving recognition performance and augmentation.
In this study, we extend these strategies to the temporal dimension for videos to learn temporally invariant or temporally local features.
Based on our novel temporal data augmentation algorithms, video recognition performances are improved using only a limited amount of training data.
- Score: 9.860323576151897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep-Learning-based video recognition has shown promising improvements along
with the development of large-scale datasets and spatiotemporal network
architectures. In image recognition, learning spatially invariant features is a
key factor in improving recognition performance and robustness. Data
augmentation based on visual inductive priors, such as cropping, flipping,
rotating, or photometric jittering, is a representative approach to achieve
these features. Recent state-of-the-art recognition solutions have relied on
modern data augmentation strategies that exploit a mixture of augmentation
operations. In this study, we extend these strategies to the temporal dimension
for videos to learn temporally invariant or temporally localizable features to
cover temporal perturbations or complex actions in videos. Based on our novel
temporal data augmentation algorithms, video recognition performances are
improved using only a limited amount of training data compared to the
spatial-only data augmentation algorithms, including the 1st Visual Inductive
Priors (VIPriors) for data-efficient action recognition challenge. Furthermore,
learned features are temporally localizable that cannot be achieved using
spatial augmentation algorithms. Our source code is available at
https://github.com/taeoh-kim/temporal_data_augmentation.
Related papers
- Descriptor: Face Detection Dataset for Programmable Threshold-Based Sparse-Vision [0.8271394038014485]
This dataset is an annotated, temporal-threshold-based vision dataset for face detection tasks derived from the same videos used for Aff-Wild2.
We anticipate that this resource will significantly support the development of robust vision systems based on smart sensors that can process based on temporal-difference thresholds.
arXiv Detail & Related papers (2024-10-01T03:42:03Z) - Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion [35.88039888482076]
We introduce the first Differentiable Augmentation Search method (DAS) to generate variations of images that can be processed as videos.
DAS is extremely fast and flexible, allowing the search on very large search spaces in less than a GPU day.
We leverage DAS to guide the reshaping of the spatial receptive field by selecting task-dependant transformations.
arXiv Detail & Related papers (2024-03-22T13:27:57Z) - Augmenting Deep Learning Adaptation for Wearable Sensor Data through
Combined Temporal-Frequency Image Encoding [4.458210211781739]
We present a novel modified-recurrent plot-based image representation that seamlessly integrates both temporal and frequency domain information.
We evaluate the proposed method using accelerometer-based activity recognition data and a pretrained ResNet model, and demonstrate its superior performance compared to existing approaches.
arXiv Detail & Related papers (2023-07-03T09:29:27Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Extending Temporal Data Augmentation for Video Action Recognition [1.3807859854345832]
We propose novel techniques to strengthen the relationship between the spatial and temporal domains.
The video action recognition results of our techniques outperform their respective variants in Top-1 and Top-5 settings on the UCF-101 and the HMDB-51 datasets.
arXiv Detail & Related papers (2022-11-09T13:49:38Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Controllable Data Augmentation Through Deep Relighting [75.96144853354362]
We explore how to augment a varied set of image datasets through relighting so as to improve the ability of existing models to be invariant to illumination changes.
We develop a tool, based on an encoder-decoder network, that is able to quickly generate multiple variations of the illumination of various input scenes.
We demonstrate that by training models on datasets that have been augmented with our pipeline, it is possible to achieve higher performance on localization benchmarks.
arXiv Detail & Related papers (2021-10-26T20:02:51Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Learning Representational Invariances for Data-Efficient Action
Recognition [52.23716087656834]
We show that our data augmentation strategy leads to promising performance on the Kinetics-100, UCF-101, and HMDB-51 datasets.
We also validate our data augmentation strategy in the fully supervised setting and demonstrate improved performance.
arXiv Detail & Related papers (2021-03-30T17:59:49Z) - AdaFuse: Adaptive Temporal Fusion Network for Efficient Action
Recognition [68.70214388982545]
Temporal modelling is the key for efficient video action recognition.
We introduce an adaptive temporal fusion network, called AdaFuse, that fuses channels from current and past feature maps.
Our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods.
arXiv Detail & Related papers (2021-02-10T23:31:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.