Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised
Correspondence Learning
- URL: http://arxiv.org/abs/2203.14333v2
- Date: Tue, 29 Mar 2022 04:35:54 GMT
- Title: Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised
Correspondence Learning
- Authors: Liulei Li, Tianfei Zhou, Wenguan Wang, Lu Yang, Jianwu Li, Yi Yang
- Abstract summary: We develop LIIR, a locality-aware inter-and intra-video reconstruction framework.
We exploit cross video affinities as extra negative samples within a unified, inter-and intra-video reconstruction scheme.
- Score: 74.03651142051656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our target is to learn visual correspondence from unlabeled videos. We
develop LIIR, a locality-aware inter-and intra-video reconstruction framework
that fills in three missing pieces, i.e., instance discrimination, location
awareness, and spatial compactness, of self-supervised correspondence learning
puzzle. First, instead of most existing efforts focusing on intra-video
self-supervision only, we exploit cross video affinities as extra negative
samples within a unified, inter-and intra-video reconstruction scheme. This
enables instance discriminative representation learning by contrasting desired
intra-video pixel association against negative inter-video correspondence.
Second, we merge position information into correspondence matching, and design
a position shifting strategy to remove the side-effect of position encoding
during inter-video affinity computation, making our LIIR location-sensitive.
Third, to make full use of the spatial continuity nature of video data, we
impose a compactness-based constraint on correspondence matching, yielding more
sparse and reliable solutions. The learned representation surpasses
self-supervised state-of-the-arts on label propagation tasks including objects,
semantic parts, and keypoints.
Related papers
- Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels [34.88705952395676]
Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence)
We introduce a new approach called hybrid-learning video moment retrieval to solve the problem by knowledge transfer.
Our aim is to explore shared universal knowledge between the two domains in order to improve model learning in the weakly-labelled target domain.
arXiv Detail & Related papers (2024-06-03T21:14:53Z) - Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation [76.40565872257709]
We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning.
It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos.
Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
arXiv Detail & Related papers (2023-03-17T16:23:36Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations
in Instructional Videos [78.34818195786846]
We introduce the task of spatially localizing narrated interactions in videos.
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
We propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.
arXiv Detail & Related papers (2021-10-20T14:45:13Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.