Leveraging Motion Information for Better Self-Supervised Video Correspondence Learning
- URL: http://arxiv.org/abs/2503.12026v2
- Date: Wed, 30 Apr 2025 14:58:56 GMT
- Title: Leveraging Motion Information for Better Self-Supervised Video Correspondence Learning
- Authors: Zihan Zhou, Changrui Dai, Aibo Song, Xiaolin Fang,
- Abstract summary: We develop an efficient self-supervised Video Correspondence Learning framework.<n>First, we design a dedicated Motion Enhancement Engine that emphasizes capturing the dynamic motion of objects in videos.<n>In addition, we introduce a flexible sampling strategy for inter-pixel correspondence information.
- Score: 5.372301053935416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised video correspondence learning depends on the ability to accurately associate pixels between video frames that correspond to the same visual object. However, achieving reliable pixel matching without supervision remains a major challenge. To address this issue, recent research has focused on feature learning techniques that aim to encode unique pixel representations for matching. Despite these advances, existing methods still struggle to achieve exact pixel correspondences and often suffer from false matches, limiting their effectiveness in self-supervised settings. To this end, we explore an efficient self-supervised Video Correspondence Learning framework (MER) that aims to accurately extract object details from unlabeled videos. First, we design a dedicated Motion Enhancement Engine that emphasizes capturing the dynamic motion of objects in videos. In addition, we introduce a flexible sampling strategy for inter-pixel correspondence information (Multi-Cluster Sampler) that enables the model to pay more attention to the pixel changes of important objects in motion. Through experiments, our algorithm outperforms the state-of-the-art competitors on video correspondence learning tasks such as video object segmentation and video object keypoint tracking.
Related papers
- CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders [6.159948396712944]
CrossVideoMAE learns both video-level and frame-level richtemporal representations and semantic attributes.<n>Our method integrates mutualtemporal information from videos with spatial information from sampled frames.<n>This is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner.
arXiv Detail & Related papers (2025-02-08T06:15:39Z) - Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation [49.113131249753714]
We propose an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues.<n> MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders.<n>We employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features.
arXiv Detail & Related papers (2025-01-14T03:15:46Z) - Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - LOCATE: Self-supervised Object Discovery via Flow-guided Graph-cut and
Bootstrapped Self-training [13.985488693082981]
We propose a self-supervised object discovery approach that leverages motion and appearance information to produce high-quality object segmentation masks.
We demonstrate the effectiveness of our approach, named LOCATE, on multiple standard video object segmentation, image saliency detection, and object segmentation benchmarks.
arXiv Detail & Related papers (2023-08-22T07:27:09Z) - Learning Fine-Grained Features for Pixel-wise Video Correspondences [13.456993858078514]
We address the problem of learning features for establishing pixel-wise correspondences.
Motivated by optical flows as well as the self-supervised feature learning, we propose to use not only labeled synthetic videos but also unlabeled real-world videos.
Our experimental results on a series of correspondence-based tasks demonstrate that the proposed method outperforms state-of-the-art rivals in both accuracy and efficiency.
arXiv Detail & Related papers (2023-08-06T07:27:17Z) - Pixel-level Correspondence for Self-Supervised Learning from Video [56.24439897867531]
Pixel-level Correspondence (PiCo) is a method for dense contrastive learning from video.
We validate PiCo on standard benchmarks, outperforming self-supervised baselines on multiple dense prediction tasks.
arXiv Detail & Related papers (2022-07-08T12:50:13Z) - Learning Pixel-Level Distinctions for Video Highlight Detection [39.23271866827123]
We propose to learn pixel-level distinctions to improve the video highlight detection.
This pixel-level distinction indicates whether or not each pixel in one video belongs to an interesting section.
We design an encoder-decoder network to estimate the pixel-level distinction.
arXiv Detail & Related papers (2022-04-10T06:41:16Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.