Integrative Feature and Cost Aggregation with Transformers for Dense
Correspondence
- URL: http://arxiv.org/abs/2209.08742v2
- Date: Tue, 20 Sep 2022 04:51:13 GMT
- Title: Integrative Feature and Cost Aggregation with Transformers for Dense
Correspondence
- Authors: Sunghwan Hong, Seokju Cho, Seungryong Kim, Stephen Lin
- Abstract summary: The current state-of-the-art are Transformer-based approaches that focus on either feature descriptors or cost volume aggregation.
We propose a novel Transformer-based network that interleaves both forms of aggregations in a way that exploits their complementary information.
We evaluate the effectiveness of the proposed method on dense matching tasks and achieve state-of-the-art performance on all the major benchmarks.
- Score: 63.868905184847954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel architecture for dense correspondence. The current
state-of-the-art are Transformer-based approaches that focus on either feature
descriptors or cost volume aggregation. However, they generally aggregate one
or the other but not both, though joint aggregation would boost each other by
providing information that one has but other lacks, i.e., structural or
semantic information of an image, or pixel-wise matching similarity. In this
work, we propose a novel Transformer-based network that interleaves both forms
of aggregations in a way that exploits their complementary information.
Specifically, we design a self-attention layer that leverages the descriptor to
disambiguate the noisy cost volume and that also utilizes the cost volume to
aggregate features in a manner that promotes accurate matching. A subsequent
cross-attention layer performs further aggregation conditioned on the
descriptors of both images and aided by the aggregated outputs of earlier
layers. We further boost the performance with hierarchical processing, in which
coarser level aggregations guide those at finer levels. We evaluate the
effectiveness of the proposed method on dense matching tasks and achieve
state-of-the-art performance on all the major benchmarks. Extensive ablation
studies are also provided to validate our design choices.
Related papers
- Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence [51.54175067684008]
This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks.
We first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes.
Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
arXiv Detail & Related papers (2024-03-17T07:02:55Z) - Learning Image Deraining Transformer Network with Dynamic Dual
Self-Attention [46.11162082219387]
This paper proposes an effective image deraining Transformer with dynamic dual self-attention (DDSA)
Specifically, we only select the most useful similarity values based on top-k approximate calculation to achieve sparse attention.
In addition, we also develop a novel spatial-enhanced feed-forward network (SEFN) to further obtain a more accurate representation for achieving high-quality derained results.
arXiv Detail & Related papers (2023-08-15T13:59:47Z) - Measuring the Mixing of Contextual Information in the Transformer [0.19116784879310028]
We consider the whole attention block --multi-head attention, residual connection, and layer normalization-- and define a metric to measure token-to-token interactions.
Then, we aggregate layer-wise interpretations to provide input attribution scores for model predictions.
Experimentally, we show that our method, ALTI, provides faithful explanations and outperforms similar aggregation methods.
arXiv Detail & Related papers (2022-03-08T17:21:27Z) - Augmenting Convolutional networks with attention-based aggregation [55.97184767391253]
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning.
We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth)
It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption.
arXiv Detail & Related papers (2021-12-27T14:05:41Z) - Cost Aggregation Is All You Need for Few-Shot Segmentation [28.23753949369226]
We introduce Volumetric Aggregation with Transformers (VAT) to tackle the few-shot segmentation task.
VAT uses both convolutions and transformers to efficiently handle high dimensional correlation maps between query and support.
We find that the proposed method attains state-of-the-art performance even for the standard benchmarks in semantic correspondence task.
arXiv Detail & Related papers (2021-12-22T06:18:51Z) - Semantic Correspondence with Transformers [68.37049687360705]
We propose Cost Aggregation with Transformers (CATs) to find dense correspondences between semantically similar images.
We include appearance affinity modelling to disambiguate the initial correlation maps and multi-level aggregation.
We conduct experiments to demonstrate the effectiveness of the proposed model over the latest methods and provide extensive ablation studies.
arXiv Detail & Related papers (2021-06-04T14:39:03Z) - GOCor: Bringing Globally Optimized Correspondence Volumes into Your
Neural Network [176.3781969089004]
Feature correlation layer serves as a key neural network module in computer vision problems that involve dense correspondences between image pairs.
We propose GOCor, a fully differentiable dense matching module, acting as a direct replacement to the feature correlation layer.
Our approach significantly outperforms the feature correlation layer for the tasks of geometric matching, optical flow, and dense semantic matching.
arXiv Detail & Related papers (2020-09-16T17:33:01Z) - Augmented Parallel-Pyramid Net for Attention Guided Pose-Estimation [90.28365183660438]
This paper proposes an augmented parallel-pyramid net with attention partial module and differentiable auto-data augmentation.
We define a new pose search space where the sequences of data augmentations are formulated as a trainable and operational CNN component.
Notably, our method achieves the top-1 accuracy on the challenging COCO keypoint benchmark and the state-of-the-art results on the MPII datasets.
arXiv Detail & Related papers (2020-03-17T03:52:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.