Where Does the Performance Improvement Come From? - A Reproducibility
Concern about Image-Text Retrieval
- URL: http://arxiv.org/abs/2203.03853v1
- Date: Tue, 8 Mar 2022 05:01:43 GMT
- Title: Where Does the Performance Improvement Come From? - A Reproducibility
Concern about Image-Text Retrieval
- Authors: Jun Rao, Fei Wang, Liang Ding, Shuhan Qi, Yibing Zhan, Weifeng Liu,
Dacheng Tao
- Abstract summary: Image-text retrieval has gradually become a major research direction in the field of information retrieval.
We first examine the related concerns and why the focus is on image-text retrieval tasks.
We analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models.
- Score: 85.03655458677295
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: This paper seeks to provide the information retrieval community with some
reflections on the current improvements of retrieval learning through the
analysis of the reproducibility aspects of image-text retrieval models. For the
latter part of the past decade, image-text retrieval has gradually become a
major research direction in the field of information retrieval because of the
growth of multi-modal data. Many researchers use benchmark datasets like
MS-COCO and Flickr30k to train and assess the performance of image-text
retrieval algorithms. Research in the past has mostly focused on performance,
with several state-of-the-art methods being proposed in various ways. According
to their claims, these approaches achieve better modal interactions and thus
better multimodal representations with greater precision. In contrast to those
previous works, we focus on the repeatability of the approaches and the overall
examination of the elements that lead to improved performance by pretrained and
nonpretrained models in retrieving images and text. To be more specific, we
first examine the related reproducibility concerns and why the focus is on
image-text retrieval tasks, and then we systematically summarize the current
paradigm of image-text retrieval models and the stated contributions of those
approaches. Second, we analyze various aspects of the reproduction of
pretrained and nonpretrained retrieval models. Based on this, we conducted
ablation experiments and obtained some influencing factors that affect
retrieval recall more than the improvement claimed in the original paper.
Finally, we also present some reflections and issues that should be considered
by the retrieval community in the future. Our code is freely available at
https://github.com/WangFei-2019/Image-text-Retrieval.
Related papers
- Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images [67.18010640829682]
We show that AI-generated images introduce an invisible relevance bias to text-image retrieval models.
The inclusion of AI-generated images in the training data of the retrieval models exacerbates the invisible relevance bias.
We propose an effective training method aimed at alleviating the invisible relevance bias.
arXiv Detail & Related papers (2023-11-23T16:22:58Z) - Rethinking Benchmarks for Cross-modal Image-text Retrieval [44.31783230767321]
Cross-modal semantic understanding and matching is a major challenge in image-text retrieval.
In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching.
We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort.
The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
arXiv Detail & Related papers (2023-04-21T09:07:57Z) - Semantic-Preserving Augmentation for Robust Image-Text Retrieval [27.2916415148638]
RVSE consists of novel image-based and text-based augmentation techniques called semantic preserving augmentation for image (SPAugI) and text (SPAugT)
Since SPAugI and SPAugT change the original data in a way that its semantic information is preserved, we enforce the feature extractors to generate semantic aware embedding vectors.
From extensive experiments using benchmark datasets, we show that RVSE outperforms conventional retrieval schemes in terms of image-text retrieval performance.
arXiv Detail & Related papers (2023-03-10T03:50:44Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Cross-Modal Retrieval Augmentation for Multi-Modal Classification [61.5253261560224]
We explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering.
First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement on image-caption retrieval.
Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines.
arXiv Detail & Related papers (2021-04-16T13:27:45Z) - A Decade Survey of Content Based Image Retrieval using Deep Learning [13.778851745408133]
This paper presents a comprehensive survey of deep learning based developments in the past decade for content based image retrieval.
The similarity between the representative features of the query image and dataset images is used to rank the images for retrieval.
Deep learning has emerged as a dominating alternative of hand-designed feature engineering from a decade.
arXiv Detail & Related papers (2020-11-23T02:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.