Auxiliary Learning for Self-Supervised Video Representation via
Similarity-based Knowledge Distillation
- URL: http://arxiv.org/abs/2112.04011v1
- Date: Tue, 7 Dec 2021 21:50:40 GMT
- Title: Auxiliary Learning for Self-Supervised Video Representation via
Similarity-based Knowledge Distillation
- Authors: Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi
- Abstract summary: We propose a novel approach to complement self-supervised pretraining via an auxiliary pretraining phase, based on knowledge similarity distillation, auxSKD.
Our method deploys a teacher network that iteratively distils its knowledge to the student model by capturing the similarity information between segments of unlabelled video data.
We also introduce a novel pretext task, Video Segment Pace Prediction or VSPP, which requires our model to predict the playback speed of a randomly selected segment of the input video to provide more reliable self-supervised representations.
- Score: 2.6519061087638014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the outstanding success of self-supervised pretraining methods for
video representation learning, they generalise poorly when the unlabeled
dataset for pretraining is small or the domain difference between unlabelled
data in source task (pretraining) and labeled data in target task (finetuning)
is significant. To mitigate these issues, we propose a novel approach to
complement self-supervised pretraining via an auxiliary pretraining phase,
based on knowledge similarity distillation, auxSKD, for better generalisation
with a significantly smaller amount of video data, e.g. Kinetics-100 rather
than Kinetics-400. Our method deploys a teacher network that iteratively
distils its knowledge to the student model by capturing the similarity
information between segments of unlabelled video data. The student model then
solves a pretext task by exploiting this prior knowledge. We also introduce a
novel pretext task, Video Segment Pace Prediction or VSPP, which requires our
model to predict the playback speed of a randomly selected segment of the input
video to provide more reliable self-supervised representations. Our
experimental results show superior results to the state of the art on both
UCF101 and HMDB51 datasets when pretraining on K100. Additionally, we show that
our auxiliary pertaining, auxSKD, when added as an extra pretraining phase to
recent state of the art self-supervised methods (e.g. VideoPace and RSPNet),
improves their results on UCF101 and HMDB51. Our code will be released soon.
Related papers
- HomE: Homography-Equivariant Video Representation Learning [62.89516761473129]
We propose a novel method for representation learning of multi-view videos.
Our method learns an implicit mapping between different views, culminating in a representation space that maintains the homography relationship between neighboring views.
On action classification, our method obtains 96.4% 3-fold accuracy on the UCF101 dataset, better than most state-of-the-art self-supervised learning methods.
arXiv Detail & Related papers (2023-06-02T15:37:43Z) - LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training [0.0]
We propose a novel learning approach based on contrastive learning, LAVA, which is capable of learning joint language, audio, and video representations in a self-supervised manner.
We demonstrate that LAVA performs competitively with the current state-of-the-art self-supervised and weakly-supervised pretraining techniques.
arXiv Detail & Related papers (2022-07-16T21:46:16Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Hierarchical Self-supervised Representation Learning for Movie
Understanding [24.952866206036536]
We propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model.
Specifically, we propose to pretrain the low-level video backbone using a contrastive learning objective, while pretrain the higher-level video contextualizer using an event mask prediction task.
We first show that our self-supervised pretraining strategies are effective and lead to improved performance on all tasks and metrics on VidSitu benchmark [37] (e.g., improving on semantic role prediction from 47% to 61% CIDEr scores)
arXiv Detail & Related papers (2022-04-06T21:28:41Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - CUPID: Adaptive Curation of Pre-training Data for Video-and-Language
Representation Learning [49.18591896085498]
We propose CUPID to bridge the domain gap between source and target data.
CUPID yields new state-of-the-art performance across multiple video-language and video tasks.
arXiv Detail & Related papers (2021-04-01T06:42:16Z) - Learning from Weakly-labeled Web Videos via Exploring Sub-Concepts [89.06560404218028]
We introduce a new method for pre-training video action recognition models using queried web videos.
Instead of trying to filter out, we propose to convert the potential noises in these queried videos to useful supervision signals.
We show that SPL outperforms several existing pre-training strategies using pseudo-labels.
arXiv Detail & Related papers (2021-01-11T05:50:16Z) - Self-supervised pre-training and contrastive representation learning for
multiple-choice video QA [39.78914328623504]
Video Question Answering (Video QA) requires fine-grained understanding of both video and language modalities to answer the given questions.
We propose novel training schemes for multiple-choice video question answering with a self-supervised pre-training stage and a supervised contrastive learning in the main stage as an auxiliary learning.
We evaluate our proposed model on highly competitive benchmark datasets related to multiple-choice video QA: TVQA, TVQA+, and DramaQA.
arXiv Detail & Related papers (2020-09-17T03:37:37Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.