Spatiotemporal Augmentation on Selective Frequencies for Video
Representation Learning
- URL: http://arxiv.org/abs/2204.03865v1
- Date: Fri, 8 Apr 2022 06:19:32 GMT
- Title: Spatiotemporal Augmentation on Selective Frequencies for Video
Representation Learning
- Authors: Jinhyung Kim, Taeoh Kim, Minho Shim, Dongyoon Han, Dongyoon Wee and
Junmo Kim
- Abstract summary: We propose FreqAug to filter data augmentation in frequency domain for video representation.
FreqAug pushes the model to focus more on dynamic features in the video via dropping spatial or temporal low-frequency components.
To verify the generality of the proposed method, we experiment with FreqAug on multiple self-supervised learning frameworks along with standard augmentations.
- Score: 36.352159541825095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent self-supervised video representation learning methods focus on
maximizing the similarity between multiple augmented views from the same video
and largely rely on the quality of generated views. In this paper, we propose
frequency augmentation (FreqAug), a spatio-temporal data augmentation method in
the frequency domain for video representation learning. FreqAug stochastically
removes undesirable information from the video by filtering out specific
frequency components so that learned representation captures essential features
of the video for various downstream tasks. Specifically, FreqAug pushes the
model to focus more on dynamic features rather than static features in the
video via dropping spatial or temporal low-frequency components. In other
words, learning invariance between remaining frequency components results in
high-frequency enhanced representation with less static bias. To verify the
generality of the proposed method, we experiment with FreqAug on multiple
self-supervised learning frameworks along with standard augmentations.
Transferring the improved representation to five video action recognition and
two temporal action localization downstream tasks shows consistent improvements
over baselines.
Related papers
- Multi-View Frequency-Attention Alternative to CNN Frontends for
Automatic Speech Recognition [12.980843126905203]
We show that global attention over frequencies is beneficial over local convolution.
We obtain 2.4 % relative word error rate reduction on a production scale replacing its convolutional neural network transducer.
arXiv Detail & Related papers (2023-06-12T08:37:36Z) - Learning Spatiotemporal Frequency-Transformer for Low-Quality Video
Super-Resolution [47.5883522564362]
Video Super-Resolution (VSR) aims to restore high-resolution (HR) videos from low-resolution (LR) videos.
Existing VSR techniques usually recover HR frames by extracting textures from nearby frames with known degradation processes.
We propose a novel Frequency-Transformer (FTVSR) for handling low-quality videos that carry out self-attention in a combined space-time-frequency domain.
arXiv Detail & Related papers (2022-12-27T16:26:15Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in
VIS and NIR Scenario [87.72258480670627]
Existing face forgery detection methods based on frequency domain find that the GAN forged images have obvious grid-like visual artifacts in the frequency spectrum compared to the real images.
This paper proposes a Cosine Transform-based Forgery Clue Augmentation Network (FCAN-DCT) to achieve a more comprehensive spatial-temporal feature representation.
arXiv Detail & Related papers (2022-07-05T09:27:53Z) - Controllable Augmentations for Video Representation Learning [34.79719112810065]
We propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as minimization general long-term temporal relations.
Our framework is superior on three video benchmarks in action recognition and video retrieval, capturing more accurate temporal dynamics.
arXiv Detail & Related papers (2022-03-30T19:34:32Z) - Adaptive Frequency Learning in Two-branch Face Forgery Detection [66.91715092251258]
We propose Adaptively learn Frequency information in the two-branch Detection framework, dubbed AFD.
We liberate our network from the fixed frequency transforms, and achieve better performance with our data- and task-dependent transform layers.
arXiv Detail & Related papers (2022-03-27T14:25:52Z) - Time-Equivariant Contrastive Video Representation Learning [47.50766781135863]
We introduce a novel self-supervised contrastive learning method to learn representations from unlabelled videos.
Our experiments show that time-equivariant representations achieve state-of-the-art results in video retrieval and action recognition benchmarks.
arXiv Detail & Related papers (2021-12-07T10:45:43Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.