No More Shortcuts: Realizing the Potential of Temporal Self-Supervision
- URL: http://arxiv.org/abs/2312.13008v1
- Date: Wed, 20 Dec 2023 13:20:31 GMT
- Title: No More Shortcuts: Realizing the Potential of Temporal Self-Supervision
- Authors: Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah
- Abstract summary: We propose a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks.
We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision.
- Score: 69.59938105887538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised approaches for video have shown impressive results in video
understanding tasks. However, unlike early works that leverage temporal
self-supervision, current state-of-the-art methods primarily rely on tasks from
the image domain (e.g., contrastive learning) that do not explicitly promote
the learning of temporal features. We identify two factors that limit existing
temporal self-supervision: 1) tasks are too simple, resulting in saturated
training performance, and 2) we uncover shortcuts based on local appearance
statistics that hinder the learning of high-level features. To address these
issues, we propose 1) a more challenging reformulation of temporal
self-supervision as frame-level (rather than clip-level) recognition tasks and
2) an effective augmentation strategy to mitigate shortcuts. Our model extends
a representation of single video frames, pre-trained through contrastive
learning, with a transformer that we train through temporal self-supervision.
We demonstrate experimentally that our more challenging frame-level task
formulations and the removal of shortcuts drastically improve the quality of
features learned through temporal self-supervision. The generalization
capability of our self-supervised video method is evidenced by its
state-of-the-art performance in a wide range of high-level semantic tasks,
including video retrieval, action classification, and video attribute
recognition (such as object and scene identification), as well as low-level
temporal correspondence tasks like video object segmentation and pose tracking.
Additionally, we show that the video representations learned through our method
exhibit increased robustness to the input perturbations.
Related papers
- Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - Tragedy Plus Time: Capturing Unintended Human Activities from
Weakly-labeled Videos [31.1632730473261]
W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations.
We propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video.
arXiv Detail & Related papers (2022-04-28T14:56:43Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Exploring Temporal Granularity in Self-Supervised Video Representation
Learning [99.02421058335533]
This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations.
The flexibility of TeG gives rise to state-of-the-art results on 8 video benchmarks, outperforming supervised pre-training in most cases.
arXiv Detail & Related papers (2021-12-08T18:58:42Z) - Stacked Temporal Attention: Improving First-person Action Recognition by
Emphasizing Discriminative Clips [39.29955809641396]
Many backgrounds or noisy frames in a first-person video can distract an action recognition model during its learning process.
Previous works explored to address this problem by applying temporal attention but failed to consider the global context of the full video.
We propose a simple yet effective Stacked Temporal Attention Module (STAM) to compute temporal attention based on the global knowledge across clips.
arXiv Detail & Related papers (2021-12-02T08:02:35Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.