Transfer of Representations to Video Label Propagation: Implementation
Factors Matter
- URL: http://arxiv.org/abs/2203.05553v1
- Date: Thu, 10 Mar 2022 18:58:22 GMT
- Title: Transfer of Representations to Video Label Propagation: Implementation
Factors Matter
- Authors: Daniel McKee, Zitong Zhan, Bing Shuai, Davide Modolo, Joseph Tighe,
Svetlana Lazebnik
- Abstract summary: We study the impact of important implementation factors in feature extraction and label propagation.
We show that augmenting video-based correspondence cues with still-image-based ones can further improve performance.
We hope that this study will help to improve evaluation practices and better inform future research directions in temporal correspondence.
- Score: 31.030799003595522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies feature representations for dense label propagation in
video, with a focus on recently proposed methods that learn video
correspondence using self-supervised signals such as colorization or temporal
cycle consistency. In the literature, these methods have been evaluated with an
array of inconsistent settings, making it difficult to discern trends or
compare performance fairly. Starting with a unified formulation of the label
propagation algorithm that encompasses most existing variations, we
systematically study the impact of important implementation factors in feature
extraction and label propagation. Along the way, we report the accuracies of
properly tuned supervised and unsupervised still image baselines, which are
higher than those found in previous works. We also demonstrate that augmenting
video-based correspondence cues with still-image-based ones can further improve
performance. We then attempt a fair comparison of recent video-based methods on
the DAVIS benchmark, showing convergence of best methods to performance levels
near our strong ImageNet baseline, despite the usage of a variety of
specialized video-based losses and training particulars. Additional comparisons
on JHMDB and VIP datasets confirm the similar performance of current methods.
We hope that this study will help to improve evaluation practices and better
inform future research directions in temporal correspondence.
Related papers
- ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Weakly Supervised Video Salient Object Detection [79.51227350937721]
We present the first weakly supervised video salient object detection model based on relabeled "fixation guided scribble annotations"
An "Appearance-motion fusion module" and bidirectional ConvLSTM based framework are proposed to achieve effective multi-modal learning and long-term temporal context modeling.
arXiv Detail & Related papers (2021-04-06T09:48:38Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Self-supervised Co-training for Video Representation Learning [103.69904379356413]
We investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation training.
We propose a novel self-supervised co-training scheme to improve the popular infoNCE loss.
We evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval.
arXiv Detail & Related papers (2020-10-19T17:59:01Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z) - Self-supervised learning using consistency regularization of
spatio-temporal data augmentation for action recognition [15.701647552427708]
We present a novel way to obtain the surrogate supervision signal based on high-level feature maps under consistency regularization.
Our method achieves substantial improvements compared with state-of-the-art self-supervised learning methods for action recognition.
arXiv Detail & Related papers (2020-08-05T12:41:59Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.