ALADIN: Distilling Fine-grained Alignment Scores for Efficient
Image-Text Matching and Retrieval
- URL: http://arxiv.org/abs/2207.14757v1
- Date: Fri, 29 Jul 2022 16:01:48 GMT
- Title: ALADIN: Distilling Fine-grained Alignment Scores for Efficient
Image-Text Matching and Retrieval
- Authors: Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi,
Fabrizio Falchi, Giuseppe Amato, Rita Cucchiara
- Abstract summary: Cross-modal retrieval consists in finding images related to a given query text or vice-versa.
Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks.
This paper proposes an ALign And DIstill Network (ALADIN) to fill in the gap between effectiveness and efficiency.
- Score: 51.588385824875886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text matching is gaining a leading role among tasks involving the joint
understanding of vision and language. In literature, this task is often used as
a pre-training objective to forge architectures able to jointly deal with
images and texts. Nonetheless, it has a direct downstream application:
cross-modal retrieval, which consists in finding images related to a given
query text or vice-versa. Solving this task is of critical importance in
cross-modal search engines. Many recent methods proposed effective solutions to
the image-text matching problem, mostly using recent large vision-language (VL)
Transformer networks. However, these models are often computationally
expensive, especially at inference time. This prevents their adoption in
large-scale cross-modal retrieval scenarios, where results should be provided
to the user almost instantaneously. In this paper, we propose to fill in the
gap between effectiveness and efficiency by proposing an ALign And DIstill
Network (ALADIN). ALADIN first produces high-effective scores by aligning at
fine-grained level images and texts. Then, it learns a shared embedding space -
where an efficient kNN search can be performed - by distilling the relevance
scores obtained from the fine-grained alignments. We obtained remarkable
results on MS-COCO, showing that our method can compete with state-of-the-art
VL Transformers while being almost 90 times faster. The code for reproducing
our results is available at https://github.com/mesnico/ALADIN.
Related papers
- Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z) - Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening [53.1711708318581]
Current image-text retrieval methods suffer from $N$-related time complexity.
This paper presents a simple and effective keyword-guided pre-screening framework for the image-text retrieval.
arXiv Detail & Related papers (2023-03-14T09:36:42Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Towards Efficient Cross-Modal Visual Textual Retrieval using
Transformer-Encoder Deep Features [10.163477961551592]
Cross-modal retrieval is an important functionality in modern search engines.
In this paper, we focus on the image-sentence retrieval task.
We use the recently introduced TERN architecture as an image-sentence features extractor.
arXiv Detail & Related papers (2021-06-01T10:11:46Z) - Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using
Transformer Encoders [14.634046503477979]
We present a novel approach called Transformer Reasoning and Alignment Network (TERAN)
TERAN enforces a fine-grained match between the underlying components of images and sentences.
On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks.
arXiv Detail & Related papers (2020-08-12T11:02:40Z) - Transformer Reasoning Network for Image-Text Matching and Retrieval [14.238818604272751]
We consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval.
We introduce the Transformer Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive, the Transformer.
TERN is able to separately reason on the two different modalities and to enforce a final common abstract concept space.
arXiv Detail & Related papers (2020-04-20T09:09:01Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.