TimeBalance: Temporally-Invariant and Temporally-Distinctive Video
Representations for Semi-Supervised Action Recognition
- URL: http://arxiv.org/abs/2303.16268v1
- Date: Tue, 28 Mar 2023 19:28:54 GMT
- Title: TimeBalance: Temporally-Invariant and Temporally-Distinctive Video
Representations for Semi-Supervised Action Recognition
- Authors: Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Chen Chen, Mubarak
Shah
- Abstract summary: We propose a student-teacher semi-supervised learning framework, TimeBalance.
We distill the knowledge from a temporally-invariant and a temporally-distinctive teacher.
Our method achieves state-of-the-art performance on three action recognition benchmarks.
- Score: 68.53072549422775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semi-Supervised Learning can be more beneficial for the video domain compared
to images because of its higher annotation cost and dimensionality. Besides,
any video understanding task requires reasoning over both spatial and temporal
dimensions. In order to learn both the static and motion related features for
the semi-supervised action recognition task, existing methods rely on hard
input inductive biases like using two-modalities (RGB and Optical-flow) or
two-stream of different playback rates. Instead of utilizing unlabeled videos
through diverse input streams, we rely on self-supervised video
representations, particularly, we utilize temporally-invariant and
temporally-distinctive representations. We observe that these representations
complement each other depending on the nature of the action. Based on this
observation, we propose a student-teacher semi-supervised learning framework,
TimeBalance, where we distill the knowledge from a temporally-invariant and a
temporally-distinctive teacher. Depending on the nature of the unlabeled video,
we dynamically combine the knowledge of these two teachers based on a novel
temporal similarity-based reweighting scheme. Our method achieves
state-of-the-art performance on three action recognition benchmarks: UCF101,
HMDB51, and Kinetics400. Code: https://github.com/DAVEISHAN/TimeBalance
Related papers
- Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - Time-Equivariant Contrastive Video Representation Learning [47.50766781135863]
We introduce a novel self-supervised contrastive learning method to learn representations from unlabelled videos.
Our experiments show that time-equivariant representations achieve state-of-the-art results in video retrieval and action recognition benchmarks.
arXiv Detail & Related papers (2021-12-07T10:45:43Z) - Learning from Temporal Gradient for Semi-supervised Action Recognition [15.45239134477737]
We introduce temporal gradient as an additional modality for more attentive feature extraction.
Our method achieves the state-of-the-art performance on three video action recognition benchmarks.
arXiv Detail & Related papers (2021-11-25T20:30:30Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.