Motion-Augmented Self-Training for Video Recognition at Smaller Scale
- URL: http://arxiv.org/abs/2105.01646v1
- Date: Tue, 4 May 2021 17:43:19 GMT
- Title: Motion-Augmented Self-Training for Video Recognition at Smaller Scale
- Authors: Kirill Gavrilyuk, Mihir Jain, Ilia Karmanov, Cees G. M. Snoek
- Abstract summary: We propose the first motion-augmented self-training regime, we call MotionFit.
We generate pseudo-labels for a large unlabeled video collection, which enables us to transfer knowledge by learning to predict these pseudo-labels with an appearance model.
We obtain a strong motion-augmented representation model suited for video downstream tasks like action recognition and clip retrieval.
- Score: 32.73585552425734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this paper is to self-train a 3D convolutional neural network on
an unlabeled video collection for deployment on small-scale video collections.
As smaller video datasets benefit more from motion than appearance, we strive
to train our network using optical flow, but avoid its computation during
inference. We propose the first motion-augmented self-training regime, we call
MotionFit. We start with supervised training of a motion model on a small, and
labeled, video collection. With the motion model we generate pseudo-labels for
a large unlabeled video collection, which enables us to transfer knowledge by
learning to predict these pseudo-labels with an appearance model. Moreover, we
introduce a multi-clip loss as a simple yet efficient way to improve the
quality of the pseudo-labeling, even without additional auxiliary tasks. We
also take into consideration the temporal granularity of videos during
self-training of the appearance model, which was missed in previous works. As a
result we obtain a strong motion-augmented representation model suited for
video downstream tasks like action recognition and clip retrieval. On
small-scale video datasets, MotionFit outperforms alternatives for knowledge
transfer by 5%-8%, video-only self-supervision by 1%-7% and semi-supervised
learning by 9%-18% using the same amount of class labels.
Related papers
- VideoCutLER: Surprisingly Simple Unsupervised Video Instance
Segmentation [87.13210748484217]
VideoCutLER is a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos.
We show the first competitive unsupervised learning results on the challenging YouTubeVIS 2019 benchmark, achieving 50.7% APvideo50.
VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS 2019 in terms of APvideo.
arXiv Detail & Related papers (2023-08-28T17:10:12Z) - Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization [23.245275661852446]
We propose a self-supervised method for learning motion-focused video representations.
We learn similarities between videos with identical local motion dynamics but an otherwise different appearance.
Our approach maintains performance when using only 25% of the pretraining videos.
arXiv Detail & Related papers (2023-03-20T10:31:35Z) - Spatio-Temporal Crop Aggregation for Video Representation Learning [33.296154476701055]
Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone.
We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and $k$-NN probing on common action classification datasets.
arXiv Detail & Related papers (2022-11-30T14:43:35Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Multiview Pseudo-Labeling for Semi-supervised Learning from Video [102.36355560553402]
We present a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.
Our method capitalizes on multiple views, but it nonetheless trains a model that is shared across appearance and motion input.
On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
arXiv Detail & Related papers (2021-04-01T17:59:48Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting [107.39743751292028]
TransMoMo is capable of transferring motion of a person in a source video realistically to another video of a target person.
We exploit invariance properties of three factors of variation including motion, structure, and view-angle.
We demonstrate the effectiveness of our method over the state-of-the-art methods.
arXiv Detail & Related papers (2020-03-31T17:49:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.