Controllable Augmentations for Video Representation Learning
- URL: http://arxiv.org/abs/2203.16632v2
- Date: Fri, 1 Apr 2022 06:52:56 GMT
- Title: Controllable Augmentations for Video Representation Learning
- Authors: Rui Qian, Weiyao Lin, John See, Dian Li
- Abstract summary: We propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as minimization general long-term temporal relations.
Our framework is superior on three video benchmarks in action recognition and video retrieval, capturing more accurate temporal dynamics.
- Score: 34.79719112810065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper focuses on self-supervised video representation learning. Most
existing approaches follow the contrastive learning pipeline to construct
positive and negative pairs by sampling different clips. However, this
formulation tends to bias to static background and have difficulty establishing
global temporal structures. The major reason is that the positive pairs, i.e.,
different clips sampled from the same video, have limited temporal receptive
field, and usually share similar background but differ in motions. To address
these problems, we propose a framework to jointly utilize local clips and
global videos to learn from detailed region-level correspondence as well as
general long-term temporal relations. Based on a set of controllable
augmentations, we achieve accurate appearance and motion pattern alignment
through soft spatio-temporal region contrast. Our formulation is able to avoid
the low-level redundancy shortcut by mutual information minimization to improve
the generalization. We also introduce local-global temporal order dependency to
further bridge the gap between clip-level and video-level representations for
robust temporal modeling. Extensive experiments demonstrate that our framework
is superior on three video benchmarks in action recognition and video
retrieval, capturing more accurate temporal dynamics.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - TempCLR: Temporal Alignment Representation with Contrastive Learning [35.12182087403215]
We propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly.
In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances.
arXiv Detail & Related papers (2022-12-28T08:10:31Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Exploring Temporal Granularity in Self-Supervised Video Representation
Learning [99.02421058335533]
This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations.
The flexibility of TeG gives rise to state-of-the-art results on 8 video benchmarks, outperforming supervised pre-training in most cases.
arXiv Detail & Related papers (2021-12-08T18:58:42Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Video Contrastive Learning with Global Context [37.966950264445394]
We propose a new video-level contrastive learning method based on segments to formulate positive pairs.
Our formulation is able to capture global context in a video, thus robust temporal content change.
arXiv Detail & Related papers (2021-08-05T16:42:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.