Learning Fine-Grained Features for Pixel-wise Video Correspondences
- URL: http://arxiv.org/abs/2308.03040v1
- Date: Sun, 6 Aug 2023 07:27:17 GMT
- Title: Learning Fine-Grained Features for Pixel-wise Video Correspondences
- Authors: Rui Li, Shenglong Zhou, Dong Liu
- Abstract summary: We address the problem of learning features for establishing pixel-wise correspondences.
Motivated by optical flows as well as the self-supervised feature learning, we propose to use not only labeled synthetic videos but also unlabeled real-world videos.
Our experimental results on a series of correspondence-based tasks demonstrate that the proposed method outperforms state-of-the-art rivals in both accuracy and efficiency.
- Score: 13.456993858078514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video analysis tasks rely heavily on identifying the pixels from different
frames that correspond to the same visual target. To tackle this problem,
recent studies have advocated feature learning methods that aim to learn
distinctive representations to match the pixels, especially in a
self-supervised fashion. Unfortunately, these methods have difficulties for
tiny or even single-pixel visual targets. Pixel-wise video correspondences were
traditionally related to optical flows, which however lead to deterministic
correspondences and lack robustness on real-world videos. We address the
problem of learning features for establishing pixel-wise correspondences.
Motivated by optical flows as well as the self-supervised feature learning, we
propose to use not only labeled synthetic videos but also unlabeled real-world
videos for learning fine-grained representations in a holistic framework. We
adopt an adversarial learning scheme to enhance the generalization ability of
the learned features. Moreover, we design a coarse-to-fine framework to pursue
high computational efficiency. Our experimental results on a series of
correspondence-based tasks demonstrate that the proposed method outperforms
state-of-the-art rivals in both accuracy and efficiency.
Related papers
- Aligning Motion-Blurred Images Using Contrastive Learning on Overcomplete Pixels [1.8810643529425775]
We propose a new contrastive objective for learning overcomplete pixel-level features that are invariant to motion blur.
We showcase that a simple U-Net trained with our objective can produce local features useful for aligning the frames of an unseen video captured with a moving camera under realistic and challenging conditions.
arXiv Detail & Related papers (2024-10-09T20:21:43Z) - Contrastive Losses Are Natural Criteria for Unsupervised Video
Summarization [27.312423653997087]
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing.
We propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness.
We show that by refining the pre-trained features with a lightweight contrastively learned projection module, the frame-level importance scores can be further improved.
arXiv Detail & Related papers (2022-11-18T07:01:28Z) - Pixel-level Correspondence for Self-Supervised Learning from Video [56.24439897867531]
Pixel-level Correspondence (PiCo) is a method for dense contrastive learning from video.
We validate PiCo on standard benchmarks, outperforming self-supervised baselines on multiple dense prediction tasks.
arXiv Detail & Related papers (2022-07-08T12:50:13Z) - Learning Pixel-Level Distinctions for Video Highlight Detection [39.23271866827123]
We propose to learn pixel-level distinctions to improve the video highlight detection.
This pixel-level distinction indicates whether or not each pixel in one video belongs to an interesting section.
We design an encoder-decoder network to estimate the pixel-level distinction.
arXiv Detail & Related papers (2022-04-10T06:41:16Z) - Learning from Untrimmed Videos: Self-Supervised Video Representation
Learning with Hierarchical Consistency [60.756222188023635]
We propose to learn representations by leveraging more abundant information in unsupervised videos.
HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos.
arXiv Detail & Related papers (2022-04-06T18:04:54Z) - TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [89.17394772676819]
We introduce a novel visual representation learning which relies on a handful of adaptively learned tokens.
Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks.
arXiv Detail & Related papers (2021-06-21T17:55:59Z) - Contrastive Learning of Image Representations with Cross-Video
Cycle-Consistency [13.19476138523546]
Cross-video relation has barely been explored for visual representation learning.
We propose a novel contrastive learning method which explores the cross-video relation by using cycle-consistency for general image representation learning.
We show significant improvement over state-of-the-art contrastive learning methods.
arXiv Detail & Related papers (2021-05-13T17:59:11Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.