Attention-Based Multimodal Image Matching
- URL: http://arxiv.org/abs/2103.11247v2
- Date: Sun, 24 Sep 2023 11:58:32 GMT
- Title: Attention-Based Multimodal Image Matching
- Authors: Aviad Moreshet, Yosi Keller
- Abstract summary: We propose an attention-based approach for multimodal image patch matching using a Transformer encoder.
Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues.
This is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task.
- Score: 16.335191345543063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an attention-based approach for multimodal image patch matching
using a Transformer encoder attending to the feature maps of a multiscale
Siamese CNN. Our encoder is shown to efficiently aggregate multiscale image
embeddings while emphasizing task-specific appearance-invariant image cues. We
also introduce an attention-residual architecture, using a residual connection
bypassing the encoder. This additional learning signal facilitates end-to-end
training from scratch. Our approach is experimentally shown to achieve new
state-of-the-art accuracy on both multimodal and single modality benchmarks,
illustrating its general applicability. To the best of our knowledge, this is
the first successful implementation of the Transformer encoder architecture to
the multimodal image patch matching task.
Related papers
- Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process.
Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z) - Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - Unbiased Multi-Modality Guidance for Image Inpainting [27.286351511243502]
We develop an end-to-end multi-modality guided transformer network for image inpainting.
Within each transformer block, the proposed spatial-aware attention module can learn the multi-modal structural features efficiently.
Our method enriches semantically consistent context in an image based on discriminative information from multiple modalities.
arXiv Detail & Related papers (2022-08-25T03:13:43Z) - Contrastive Attention Network with Dense Field Estimation for Face
Completion [11.631559190975034]
We propose a self-supervised Siamese inference network to improve the generalization and robustness of encoders.
To deal with geometric variations of face images, a dense correspondence field is integrated into the network.
This multi-scale architecture is beneficial for the decoder to utilize discriminative representations learned from encoders into images.
arXiv Detail & Related papers (2021-12-20T02:54:38Z) - LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [80.54244087314025]
We show that better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in vision Transformer encoder network.
Our method surpasses the previous state-of-the-art methods on RefCOCO, RefCO+, and G-Ref by large margins.
arXiv Detail & Related papers (2021-12-04T04:53:35Z) - TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework
using Self-Supervised Multi-Task Learning [5.926203312586108]
We propose TransMEF, a transformer-based multi-exposure image fusion framework.
The framework is based on an encoder-decoder network, which can be trained on large natural image datasets.
arXiv Detail & Related papers (2021-12-02T07:43:42Z) - Rethinking Coarse-to-Fine Approach in Single Image Deblurring [19.195704769925925]
We present a fast and accurate deblurring network design using a multi-input multi-output U-net.
The proposed network outperforms the state-of-the-art methods in terms of both accuracy and computational complexity.
arXiv Detail & Related papers (2021-08-11T06:37:01Z) - Learnable Fourier Features for Multi-DimensionalSpatial Positional
Encoding [96.9752763607738]
We propose a novel positional encoding method based on learnable Fourier features.
Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
arXiv Detail & Related papers (2021-06-05T04:40:18Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.