Attention-Based Multimodal Image Matching
- URL: http://arxiv.org/abs/2103.11247v2
- Date: Sun, 24 Sep 2023 11:58:32 GMT
- Title: Attention-Based Multimodal Image Matching
- Authors: Aviad Moreshet, Yosi Keller
- Abstract summary: We propose an attention-based approach for multimodal image patch matching using a Transformer encoder.
Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues.
This is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task.
- Score: 16.335191345543063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an attention-based approach for multimodal image patch matching
using a Transformer encoder attending to the feature maps of a multiscale
Siamese CNN. Our encoder is shown to efficiently aggregate multiscale image
embeddings while emphasizing task-specific appearance-invariant image cues. We
also introduce an attention-residual architecture, using a residual connection
bypassing the encoder. This additional learning signal facilitates end-to-end
training from scratch. Our approach is experimentally shown to achieve new
state-of-the-art accuracy on both multimodal and single modality benchmarks,
illustrating its general applicability. To the best of our knowledge, this is
the first successful implementation of the Transformer encoder architecture to
the multimodal image patch matching task.
Related papers
- Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - DIAMANT: Dual Image-Attention Map Encoders For Medical Image
Segmentation [46.19060502876747]
We show that by taking advantage of the attention map visualizations obtained from a self-supervised pretrained vision transformer network (e.g., DINO) one can outperform complex transformer-based networks with much less computation costs.
The results of our experiments on two publicly available medical imaging datasets show that the proposed pipeline outperforms U-Net and the state-of-the-art medical image segmentation models.
arXiv Detail & Related papers (2023-04-28T00:11:18Z) - Unbiased Multi-Modality Guidance for Image Inpainting [27.286351511243502]
We develop an end-to-end multi-modality guided transformer network for image inpainting.
Within each transformer block, the proposed spatial-aware attention module can learn the multi-modal structural features efficiently.
Our method enriches semantically consistent context in an image based on discriminative information from multiple modalities.
arXiv Detail & Related papers (2022-08-25T03:13:43Z) - Contrastive Attention Network with Dense Field Estimation for Face
Completion [11.631559190975034]
We propose a self-supervised Siamese inference network to improve the generalization and robustness of encoders.
To deal with geometric variations of face images, a dense correspondence field is integrated into the network.
This multi-scale architecture is beneficial for the decoder to utilize discriminative representations learned from encoders into images.
arXiv Detail & Related papers (2021-12-20T02:54:38Z) - LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [80.54244087314025]
We show that better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in vision Transformer encoder network.
Our method surpasses the previous state-of-the-art methods on RefCOCO, RefCO+, and G-Ref by large margins.
arXiv Detail & Related papers (2021-12-04T04:53:35Z) - TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework
using Self-Supervised Multi-Task Learning [5.926203312586108]
We propose TransMEF, a transformer-based multi-exposure image fusion framework.
The framework is based on an encoder-decoder network, which can be trained on large natural image datasets.
arXiv Detail & Related papers (2021-12-02T07:43:42Z) - Rethinking Coarse-to-Fine Approach in Single Image Deblurring [19.195704769925925]
We present a fast and accurate deblurring network design using a multi-input multi-output U-net.
The proposed network outperforms the state-of-the-art methods in terms of both accuracy and computational complexity.
arXiv Detail & Related papers (2021-08-11T06:37:01Z) - Learnable Fourier Features for Multi-DimensionalSpatial Positional
Encoding [96.9752763607738]
We propose a novel positional encoding method based on learnable Fourier features.
Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
arXiv Detail & Related papers (2021-06-05T04:40:18Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation.
We learn a latent embedding, jointly with the generator, that models the variability of the output domain.
Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.