Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
Improved Cross-Modal Retrieval
- URL: http://arxiv.org/abs/2103.11920v1
- Date: Mon, 22 Mar 2021 15:08:06 GMT
- Title: Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
Improved Cross-Modal Retrieval
- Authors: Gregor Geigle, Jonas Pfeiffer, Nils Reimers, Ivan Vuli\'c, Iryna
Gurevych
- Abstract summary: Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image.
We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model.
Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
- Score: 80.35589927511667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current state-of-the-art approaches to cross-modal retrieval process text and
visual input jointly, relying on Transformer-based architectures with
cross-attention mechanisms that attend over all words and objects in an image.
While offering unmatched retrieval performance, such models: 1) are typically
pretrained from scratch and thus less scalable, 2) suffer from huge retrieval
latency and inefficiency issues, which makes them impractical in realistic
applications. To address these crucial gaps towards both improved and efficient
cross-modal retrieval, we propose a novel fine-tuning framework which turns any
pretrained text-image multi-modal model into an efficient retrieval model. The
framework is based on a cooperative retrieve-and-rerank approach which
combines: 1) twin networks to separately encode all items of a corpus, enabling
efficient initial retrieval, and 2) a cross-encoder component for a more
nuanced (i.e., smarter) ranking of the retrieved small set of items. We also
propose to jointly fine-tune the two components with shared weights, yielding a
more parameter-efficient model. Our experiments on a series of standard
cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot
setups, demonstrate improved accuracy and huge efficiency benefits over the
state-of-the-art cross-encoders.
Related papers
- ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval.
ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers [53.224004166460254]
This paper introduces Cross-Guided Ensemble of Tokens (CrossGET), a general acceleration framework for vision-language Transformers.
CrossGET adaptively combines tokens in real-time during inference, significantly reducing computational costs.
Experiments have been conducted on various vision-language tasks, such as image-text retrieval, visual reasoning, image captioning, and visual question answering.
arXiv Detail & Related papers (2023-05-27T12:07:21Z) - Plug-and-Play Regulators for Image-Text Matching [76.28522712930668]
Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching.
We develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations.
Experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models.
arXiv Detail & Related papers (2023-03-23T15:42:05Z) - Unsupervised Contrastive Hashing for Cross-Modal Retrieval in Remote
Sensing [1.6758573326215689]
Cross-modal text-image retrieval has attracted great attention in remote sensing.
We introduce a novel unsupervised cross-modal contrastive hashing (DUCH) method for text-image retrieval in RS.
Experimental results show that the proposed DUCH outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-04-19T07:25:25Z) - Deep Unsupervised Contrastive Hashing for Large-Scale Cross-Modal
Text-Image Retrieval in Remote Sensing [1.6758573326215689]
We introduce a novel deep unsupervised cross-modal contrastive hashing (DUCH) method for RS text-image retrieval.
Experimental results show that the proposed DUCH outperforms state-of-the-art unsupervised cross-modal hashing methods.
Our code is publicly available at https://git.tu-berlin.de/rsim/duch.
arXiv Detail & Related papers (2022-01-20T12:05:10Z) - Building an Efficient and Effective Retrieval-based Dialogue System via
Mutual Learning [27.04857039060308]
We propose to combine the best of both worlds to build a retrieval system.
We employ a fast bi-encoder to replace the traditional feature-based pre-retrieval model.
We train the pre-retrieval model and the re-ranking model at the same time via mutual learning.
arXiv Detail & Related papers (2021-10-01T01:32:33Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.