Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence
- URL: http://arxiv.org/abs/2403.11120v2
- Date: Mon, 22 Apr 2024 09:06:54 GMT
- Title: Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence
- Authors: Sunghwan Hong, Seokju Cho, Seungryong Kim, Stephen Lin,
- Abstract summary: This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks.
We first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes.
Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
- Score: 51.54175067684008
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks. In the context of dense matching, many works benefit from one of two forms of aggregation: feature aggregation, which pertains to the alignment of similar features, or cost aggregation, a procedure aimed at instilling coherence in the flow estimates across neighboring pixels. In this work, we first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes. We then introduce a simple yet effective architecture that harnesses self- and cross-attention mechanisms to show that our approach unifies feature aggregation and cost aggregation and effectively harnesses the strengths of both techniques. Within the proposed attention layers, the features and cost volume both complement each other, and the attention layers are interleaved through a coarse-to-fine design to further promote accurate correspondence estimation. Finally at inference, our network produces multi-scale predictions, computes their confidence scores, and selects the most confident flow for final prediction. Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
Related papers
- A Refreshed Similarity-based Upsampler for Direct High-Ratio Feature Upsampling [54.05517338122698]
We propose an explicitly controllable query-key feature alignment from both semantic-aware and detail-aware perspectives.
We also develop a fine-grained neighbor selection strategy on HR features, which is simple yet effective for alleviating mosaic artifacts.
Our proposed ReSFU framework consistently achieves satisfactory performance on different segmentation applications.
arXiv Detail & Related papers (2024-07-02T14:12:21Z) - One for all: A novel Dual-space Co-training baseline for Large-scale
Multi-View Clustering [42.92751228313385]
We propose a novel multi-view clustering model, named Dual-space Co-training Large-scale Multi-view Clustering (DSCMC)
The main objective of our approach is to enhance the clustering performance by leveraging co-training in two distinct spaces.
Our algorithm has an approximate linear computational complexity, which guarantees its successful application on large-scale datasets.
arXiv Detail & Related papers (2024-01-28T16:30:13Z) - Local Consensus Enhanced Siamese Network with Reciprocal Loss for
Two-view Correspondence Learning [35.5851523517487]
Two-view correspondence learning usually establish an end-to-end network to jointly predict correspondence reliability and relative pose.
We propose a Local Feature Consensus (LFC) plugin block to augment the features of existing models.
We extend existing models to a Siamese network with a reciprocal loss that exploits the supervision of mutual projection.
arXiv Detail & Related papers (2023-08-06T22:20:09Z) - ECO-TR: Efficient Correspondences Finding Via Coarse-to-Fine Refinement [80.94378602238432]
We propose an efficient structure named Correspondence Efficient Transformer (ECO-TR) by finding correspondences in a coarse-to-fine manner.
To achieve this, multiple transformer blocks are stage-wisely connected to gradually refine the predicted coordinates.
Experiments on various sparse and dense matching tasks demonstrate the superiority of our method in both efficiency and effectiveness against existing state-of-the-arts.
arXiv Detail & Related papers (2022-09-25T13:05:33Z) - Integrative Feature and Cost Aggregation with Transformers for Dense
Correspondence [63.868905184847954]
The current state-of-the-art are Transformer-based approaches that focus on either feature descriptors or cost volume aggregation.
We propose a novel Transformer-based network that interleaves both forms of aggregations in a way that exploits their complementary information.
We evaluate the effectiveness of the proposed method on dense matching tasks and achieve state-of-the-art performance on all the major benchmarks.
arXiv Detail & Related papers (2022-09-19T03:33:35Z) - Consensual Aggregation on Random Projected High-dimensional Features for
Regression [0.0]
We present a study of a kernel-based consensual aggregation on randomly projected high-dimensional features of predictions for regression.
We numerically illustrate that the aggregation scheme upholds its performance on very large and highly correlated features.
The efficiency of the proposed method is illustrated through several experiments evaluated on different types of synthetic and real datasets.
arXiv Detail & Related papers (2022-04-06T06:35:47Z) - Mix and Mask Actor-Critic Methods [0.0]
Shared feature spaces for actor-critic methods aims to capture generalized latent representations to be used by the policy and value function.
We present a novel feature-sharing framework to address these difficulties by introducing the mix and mask mechanisms and the distributional scalarization technique.
From our experimental results, we demonstrate significant performance improvements compared to alternative methods using separate networks and networks with a shared backbone.
arXiv Detail & Related papers (2021-06-24T14:12:45Z) - Interactive Fusion of Multi-level Features for Compositional Activity
Recognition [100.75045558068874]
We present a novel framework that accomplishes this goal by interactive fusion.
We implement the framework in three steps, namely, positional-to-appearance feature extraction, semantic feature interaction, and semantic-to-positional prediction.
We evaluate our approach on two action recognition datasets, Something-Something and Charades.
arXiv Detail & Related papers (2020-12-10T14:17:18Z) - CoADNet: Collaborative Aggregation-and-Distribution Networks for
Co-Salient Object Detection [91.91911418421086]
Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images.
One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships.
We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
arXiv Detail & Related papers (2020-11-10T04:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.