Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using
Transformer Encoders
- URL: http://arxiv.org/abs/2008.05231v2
- Date: Tue, 2 Mar 2021 16:12:52 GMT
- Title: Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using
Transformer Encoders
- Authors: Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio
Gennaro, St\'ephane Marchand-Maillet
- Abstract summary: We present a novel approach called Transformer Reasoning and Alignment Network (TERAN)
TERAN enforces a fine-grained match between the underlying components of images and sentences.
On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks.
- Score: 14.634046503477979
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the evolution of deep-learning-based visual-textual processing
systems, precise multi-modal matching remains a challenging task. In this work,
we tackle the task of cross-modal retrieval through image-sentence matching
based on word-region alignments, using supervision only at the global
image-sentence level. Specifically, we present a novel approach called
Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a
fine-grained match between the underlying components of images and sentences,
i.e., image regions and words, respectively, in order to preserve the
informative richness of both modalities. TERAN obtains state-of-the-art results
on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover,
on MS-COCO, it also outperforms current approaches on the sentence retrieval
task.
Focusing on scalable cross-modal information retrieval, TERAN is designed to
keep the visual and textual data pipelines well separated. Cross-attention
links invalidate any chance to separately extract visual and textual features
needed for the online search and the offline indexing steps in large-scale
retrieval systems. In this respect, TERAN merges the information from the two
domains only during the final alignment phase, immediately before the loss
computation. We argue that the fine-grained alignments produced by TERAN pave
the way towards the research for effective and efficient methods for
large-scale cross-modal information retrieval. We compare the effectiveness of
our approach against relevant state-of-the-art methods. On the MS-COCO 1K test
set, we obtain an improvement of 5.7% and 3.5% respectively on the image and
the sentence retrieval tasks on the Recall@1 metric. The code used for the
experiments is publicly available on GitHub at
https://github.com/mesnico/TERAN.
Related papers
- AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z) - EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain.
EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z) - HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval.
First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module.
Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z) - ALADIN: Distilling Fine-grained Alignment Scores for Efficient
Image-Text Matching and Retrieval [51.588385824875886]
Cross-modal retrieval consists in finding images related to a given query text or vice-versa.
Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks.
This paper proposes an ALign And DIstill Network (ALADIN) to fill in the gap between effectiveness and efficiency.
arXiv Detail & Related papers (2022-07-29T16:01:48Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Towards Efficient Cross-Modal Visual Textual Retrieval using
Transformer-Encoder Deep Features [10.163477961551592]
Cross-modal retrieval is an important functionality in modern search engines.
In this paper, we focus on the image-sentence retrieval task.
We use the recently introduced TERN architecture as an image-sentence features extractor.
arXiv Detail & Related papers (2021-06-01T10:11:46Z) - Transformer Reasoning Network for Image-Text Matching and Retrieval [14.238818604272751]
We consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval.
We introduce the Transformer Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive, the Transformer.
TERN is able to separately reason on the two different modalities and to enforce a final common abstract concept space.
arXiv Detail & Related papers (2020-04-20T09:09:01Z) - RANSAC-Flow: generic two-stage image alignment [53.11926395028508]
We show that a simple unsupervised approach performs surprisingly well across a range of tasks.
Despite its simplicity, our method shows competitive results on a range of tasks and datasets.
arXiv Detail & Related papers (2020-04-03T12:37:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.