CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching
- URL: http://arxiv.org/abs/2503.23925v1
- Date: Mon, 31 Mar 2025 10:17:01 GMT
- Title: CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching
- Authors: Zizhuo Li, Yifan Lu, Linfeng Tang, Shihua Zhang, Jiayi Ma,
- Abstract summary: CoMatch is a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy.<n>A covisibility-guided token condenser is introduced to adaptively aggregate tokens in light of their covisibility scores.<n>A fine correlation module is developed to refine the matching candidates in both source and target views to subpixel level.
- Score: 31.42896369011162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This prospective study proposes CoMatch, a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy. Firstly, observing that modeling context interaction over the entire coarse feature map elicits highly redundant computation due to the neighboring representation similarity of tokens, a covisibility-guided token condenser is introduced to adaptively aggregate tokens in light of their covisibility scores that are dynamically estimated, thereby ensuring computational efficiency while improving the representational capacity of aggregated tokens simultaneously. Secondly, considering that feature interaction with massive non-covisible areas is distracting, which may degrade feature distinctiveness, a covisibility-assisted attention mechanism is deployed to selectively suppress irrelevant message broadcast from non-covisible reduced tokens, resulting in robust and compact attention to relevant rather than all ones. Thirdly, we find that at the fine-level stage, current methods adjust only the target view's keypoints to subpixel level, while those in the source view remain restricted at the coarse level and thus not informative enough, detrimental to keypoint location-sensitive usages. A simple yet potent fine correlation module is developed to refine the matching candidates in both source and target views to subpixel level, attaining attractive performance improvement. Thorough experimentation across an array of public benchmarks affirms CoMatch's promising accuracy, efficiency, and generalizability.
Related papers
- "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis [8.386261591495103]
We introduce CoCoNO, a new algorithm that optimize the initial latent by leveraging the complementary information within self-attention and cross-attention maps.
Our method introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap by ensuring each self-attention segment is exclusively linked to a specific subject's cross attention map, and the attention complete loss, which maximizes the activation within these segments to guarantee that each subject is fully and distinctly represented.
arXiv Detail & Related papers (2024-11-25T08:20:14Z) - Superpixel Cost Volume Excitation for Stereo Matching [27.757112234793624]
In this work, we concentrate on exciting the intrinsic local consistency of stereo matching through the incorporation of superpixel soft constraints.
Our approach capitalizes on the observation that neighboring pixels are predisposed to belong to the same object and exhibit closely similar intensities within the probability volume of superpixels.
arXiv Detail & Related papers (2024-11-20T07:59:55Z) - Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence [51.54175067684008]
This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks.
We first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes.
Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
arXiv Detail & Related papers (2024-03-17T07:02:55Z) - Subobject-level Image Tokenization [60.80949852899857]
Patch-based image tokenization ignores the morphology of the visual world.<n>Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation.<n>We show that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
arXiv Detail & Related papers (2024-02-22T06:47:44Z) - DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural
Network [52.29330138835208]
Accurately matching local features between a pair of images is a challenging computer vision task.
Previous studies typically use attention based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images.
We propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide message passing.
arXiv Detail & Related papers (2023-07-04T02:50:44Z) - ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer [33.603064903549985]
ASpanFormer is a Transformer-based detector-free matcher that is built on hierarchical attention structure.
We propose a novel attention operation which is capable of adjusting attention span in a self-adaptive manner.
By these means, we are able to not only maintain long-range dependencies, but also enable fine-grained attention among pixels of high relevance.
arXiv Detail & Related papers (2022-08-30T12:21:15Z) - Deep Semantic Statistics Matching (D2SM) Denoising Network [70.01091467628068]
We introduce the Deep Semantic Statistics Matching (D2SM) Denoising Network.
It exploits semantic features of pretrained classification networks, then it implicitly matches the probabilistic distribution of clear images at the semantic feature space.
By learning to preserve the semantic distribution of denoised images, we empirically find our method significantly improves the denoising capabilities of networks.
arXiv Detail & Related papers (2022-07-19T14:35:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.