Towards Fast and Accurate Image-Text Retrieval with Self-Supervised
Fine-Grained Alignment
- URL: http://arxiv.org/abs/2308.14009v1
- Date: Sun, 27 Aug 2023 05:45:54 GMT
- Title: Towards Fast and Accurate Image-Text Retrieval with Self-Supervised
Fine-Grained Alignment
- Authors: Jiamin Zhuang, Jing Yu, Yang Ding, Xiangyan Qu, Yue Hu
- Abstract summary: We propose an image-text alignment module SelfAlign on top of the independent-embedding framework.
SelfAlign forces image-text alignment at both concept level and context level by self-supervised contrastive learning.
It consistently boosts the accuracy of state-of-the-art non-pretraining independent-embedding models respectively by 9.1%, 4.2% and 6.6% in terms of R@sum score on Flickr30K, MSCOCO 1K and MS-COCO 5K datasets.
- Score: 9.183014914635553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text retrieval requires the system to bridge the heterogenous gap
between vision and language for accurate retrieval while keeping the network
lightweight-enough for efficient retrieval. Existing trade-off solutions mainly
study from the view of incorporating cross-modal interactions with the
independent-embedding framework or leveraging stronger pretrained encoders,
which still demand time-consuming similarity measurement or heavyweight model
structure in the retrieval stage. In this work, we propose an image-text
alignment module SelfAlign on top of the independent-embedding framework, which
improves the retrieval accuracy while maintains the retrieval efficiency
without extra supervision. SelfAlign contains two collaborative sub-modules
that force image-text alignment at both concept level and context level by
self-supervised contrastive learning. It does not require cross-modal embedding
interactions during training while maintaining independent image and text
encoders during retrieval. With comparable time cost, SelfAlign consistently
boosts the accuracy of state-of-the-art non-pretraining independent-embedding
models respectively by 9.1%, 4.2% and 6.6% in terms of R@sum score on
Flickr30K, MSCOCO 1K and MS-COCO 5K datasets. The retrieval accuracy also
outperforms most existing interactive-embedding models with orders of magnitude
decrease in retrieval time. The source code is available at:
https://github.com/Zjamie813/SelfAlign.
Related papers
- DEMO: A Statistical Perspective for Efficient Image-Text Matching [32.256725860652914]
We introduce Distribution-based Structure Mining with Consistency Learning (DEMO) for efficient image-text matching.
DEMO characterizes each image using multiple augmented views, which are considered as samples drawn from its intrinsic semantic distribution.
In addition, we introduce collaborative consistency learning which not only preserves the similarity structure in the Hamming space but also encourages consistency between retrieval distribution from different directions.
arXiv Detail & Related papers (2024-05-19T09:38:56Z) - Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching.
An anchor branch is first trained to provide insights into the data properties.
A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Rethinking Benchmarks for Cross-modal Image-text Retrieval [44.31783230767321]
Cross-modal semantic understanding and matching is a major challenge in image-text retrieval.
In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching.
We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort.
The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
arXiv Detail & Related papers (2023-04-21T09:07:57Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Robust Cross-Modal Representation Learning with Progressive
Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets.
We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z) - Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image.
We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model.
Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.