Related papers: Transformer Reasoning Network for Image-Text Matching and Retrieval

Transformer Reasoning Network for Image-Text Matching and Retrieval

URL: http://arxiv.org/abs/2004.09144v3
Date: Mon, 25 Jan 2021 21:19:28 GMT
Title: Transformer Reasoning Network for Image-Text Matching and Retrieval
Authors: Nicola Messina, Fabrizio Falchi, Andrea Esuli, Giuseppe Amato
Abstract summary: We consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. We introduce the Transformer Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive, the Transformer. TERN is able to separately reason on the two different modalities and to enforce a final common abstract concept space.
Score: 14.238818604272751
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN.

Related papers

Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts. Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z)
EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain. EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z)
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval [51.588385824875886]
Cross-modal retrieval consists in finding images related to a given query text or vice-versa. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. This paper proposes an ALign And DIstill Network (ALADIN) to fill in the gap between effectiveness and efficiency.
arXiv Detail & Related papers (2022-07-29T16:01:48Z)
Two-stream Hierarchical Similarity Reasoning for Image-text Matching [66.43071159630006]
A hierarchical similarity reasoning module is proposed to automatically extract context information. Previous approaches only consider learning single-stream similarity alignment. A two-stream architecture is developed to decompose image-text matching into image-to-text level and text-to-image level similarity computation.
arXiv Detail & Related papers (2022-03-10T12:56:10Z)
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning [54.82482779792115]
We propose a single UniFied transfOrmer (UFO) capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question) for vision-language (VL) representation learning. Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks.
arXiv Detail & Related papers (2021-11-19T03:23:10Z)
Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features [10.163477961551592]
Cross-modal retrieval is an important functionality in modern search engines. In this paper, we focus on the image-sentence retrieval task. We use the recently introduced TERN architecture as an image-sentence features extractor.
arXiv Detail & Related papers (2021-06-01T10:11:46Z)
Cross-modal Image Retrieval with Deep Mutual Information Maximization [14.778158582349137]
We study the cross-modal image retrieval, where the inputs contain a source image plus some text that describes certain modifications to this image and the desired image. Our method narrows the modality gap between the text modality and the image modality by maximizing mutual information between their not exactly semantically identical representation.
arXiv Detail & Related papers (2021-03-10T13:08:09Z)
DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators. We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output. Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders [14.634046503477979]
We present a novel approach called Transformer Reasoning and Alignment Network (TERAN) TERAN enforces a fine-grained match between the underlying components of images and sentences. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks.
arXiv Detail & Related papers (2020-08-12T11:02:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.