Rethinking Self-supervised Correspondence Learning: A Video Frame-level
Similarity Perspective
- URL: http://arxiv.org/abs/2103.17263v2
- Date: Thu, 1 Apr 2021 18:15:49 GMT
- Title: Rethinking Self-supervised Correspondence Learning: A Video Frame-level
Similarity Perspective
- Authors: Jiarui Xu, Xiaolong Wang
- Abstract summary: We propose to learn correspondence using Video Frame-level Similarity (VFS) learning.
Our work is inspired by the recent success in image-level contrastive learning and similarity learning for visual recognition.
Our experiments show surprising results that VFS surpasses state-of-the-art self-supervised approaches for both OTB visual object tracking and DAVIS video object segmentation.
- Score: 13.90183404059193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning a good representation for space-time correspondence is the key for
various computer vision tasks, including tracking object bounding boxes and
performing video object pixel segmentation. To learn generalizable
representation for correspondence in large-scale, a variety of self-supervised
pretext tasks are proposed to explicitly perform object-level or patch-level
similarity learning. Instead of following the previous literature, we propose
to learn correspondence using Video Frame-level Similarity (VFS) learning, i.e,
simply learning from comparing video frames. Our work is inspired by the recent
success in image-level contrastive learning and similarity learning for visual
recognition. Our hypothesis is that if the representation is good for
recognition, it requires the convolutional features to find correspondence
between similar objects or parts. Our experiments show surprising results that
VFS surpasses state-of-the-art self-supervised approaches for both OTB visual
object tracking and DAVIS video object segmentation. We perform detailed
analysis on what matters in VFS and reveals new properties on image and frame
level similarity learning. Project page is available at
\href{https://jerryxu.net/VFS}{https://jerryxu.net/VFS}.
Related papers
- VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Pixel-level Correspondence for Self-Supervised Learning from Video [56.24439897867531]
Pixel-level Correspondence (PiCo) is a method for dense contrastive learning from video.
We validate PiCo on standard benchmarks, outperforming self-supervised baselines on multiple dense prediction tasks.
arXiv Detail & Related papers (2022-07-08T12:50:13Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Contrastive Learning of Image Representations with Cross-Video
Cycle-Consistency [13.19476138523546]
Cross-video relation has barely been explored for visual representation learning.
We propose a novel contrastive learning method which explores the cross-video relation by using cycle-consistency for general image representation learning.
We show significant improvement over state-of-the-art contrastive learning methods.
arXiv Detail & Related papers (2021-05-13T17:59:11Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - CompFeat: Comprehensive Feature Aggregation for Video Instance
Segmentation [67.17625278621134]
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video.
Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects.
We propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information.
arXiv Detail & Related papers (2020-12-07T00:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.