Related papers: Attention-Based Multimodal Image Matching

Attention-Based Multimodal Image Matching

URL: http://arxiv.org/abs/2103.11247v2
Date: Sun, 24 Sep 2023 11:58:32 GMT
Title: Attention-Based Multimodal Image Matching
Authors: Aviad Moreshet, Yosi Keller
Abstract summary: We propose an attention-based approach for multimodal image patch matching using a Transformer encoder. Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues. This is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task.
Score: 16.335191345543063
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose an attention-based approach for multimodal image patch matching using a Transformer encoder attending to the feature maps of a multiscale Siamese CNN. Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues. We also introduce an attention-residual architecture, using a residual connection bypassing the encoder. This additional learning signal facilitates end-to-end training from scratch. Our approach is experimentally shown to achieve new state-of-the-art accuracy on both multimodal and single modality benchmarks, illustrating its general applicability. To the best of our knowledge, this is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task.

Related papers

METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models [92.37117312251755]
We propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR)<n>For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy.<n>For multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning.
arXiv Detail & Related papers (2025-07-28T13:50:53Z)
FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens [56.752362642658504]
We present FuseLIP, an alternative architecture for multimodal embedding.<n>We propose a single transformer model which operates on an extended vocabulary of text and image tokens.<n>We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval.
arXiv Detail & Related papers (2025-06-03T17:27:12Z)
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training [62.843316348659165]
Deep learning-based image matching algorithms have dramatically outperformed humans in rapidly and accurately finding large amounts of correspondences. We propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals to train models to recognize and match fundamental structures across images. Our key finding is that the matching model trained with our framework achieves remarkable generalizability across more than eight unseen cross-modality registration tasks.
arXiv Detail & Related papers (2025-01-13T18:37:36Z)
Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z)
Triple-View Knowledge Distillation for Semi-Supervised Semantic Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation. The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z)
Unbiased Multi-Modality Guidance for Image Inpainting [27.286351511243502]
We develop an end-to-end multi-modality guided transformer network for image inpainting. Within each transformer block, the proposed spatial-aware attention module can learn the multi-modal structural features efficiently. Our method enriches semantically consistent context in an image based on discriminative information from multiple modalities.
arXiv Detail & Related papers (2022-08-25T03:13:43Z)
Contrastive Attention Network with Dense Field Estimation for Face Completion [11.631559190975034]
We propose a self-supervised Siamese inference network to improve the generalization and robustness of encoders. To deal with geometric variations of face images, a dense correspondence field is integrated into the network. This multi-scale architecture is beneficial for the decoder to utilize discriminative representations learned from encoders into images.
arXiv Detail & Related papers (2021-12-20T02:54:38Z)
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [80.54244087314025]
We show that better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in vision Transformer encoder network. Our method surpasses the previous state-of-the-art methods on RefCOCO, RefCO+, and G-Ref by large margins.
arXiv Detail & Related papers (2021-12-04T04:53:35Z)
TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning [5.926203312586108]
We propose TransMEF, a transformer-based multi-exposure image fusion framework. The framework is based on an encoder-decoder network, which can be trained on large natural image datasets.
arXiv Detail & Related papers (2021-12-02T07:43:42Z)
Rethinking Coarse-to-Fine Approach in Single Image Deblurring [19.195704769925925]
We present a fast and accurate deblurring network design using a multi-input multi-output U-net. The proposed network outperforms the state-of-the-art methods in terms of both accuracy and computational complexity.
arXiv Detail & Related papers (2021-08-11T06:37:01Z)
Learnable Fourier Features for Multi-DimensionalSpatial Positional Encoding [96.9752763607738]
We propose a novel positional encoding method based on learnable Fourier features. Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
arXiv Detail & Related papers (2021-06-05T04:40:18Z)
Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.