Rethinking Benchmarks for Cross-modal Image-text Retrieval
- URL: http://arxiv.org/abs/2304.10824v1
- Date: Fri, 21 Apr 2023 09:07:57 GMT
- Title: Rethinking Benchmarks for Cross-modal Image-text Retrieval
- Authors: Weijing Chen, Linli Yao, Qin Jin
- Abstract summary: Cross-modal semantic understanding and matching is a major challenge in image-text retrieval.
In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching.
We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort.
The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
- Score: 44.31783230767321
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text retrieval, as a fundamental and important branch of information
retrieval, has attracted extensive research attentions. The main challenge of
this task is cross-modal semantic understanding and matching. Some recent works
focus more on fine-grained cross-modal semantic matching. With the prevalence
of large scale multimodal pretraining models, several state-of-the-art models
(e.g. X-VLM) have achieved near-perfect performance on widely-used image-text
retrieval benchmarks, i.e. MSCOCO-Test-5K and Flickr30K-Test-1K. In this paper,
we review the two common benchmarks and observe that they are insufficient to
assess the true capability of models on fine-grained cross-modal semantic
matching. The reason is that a large amount of images and texts in the
benchmarks are coarse-grained. Based on the observation, we renovate the
coarse-grained images and texts in the old benchmarks and establish the
improved benchmarks called MSCOCO-FG and Flickr30K-FG. Specifically, on the
image side, we enlarge the original image pool by adopting more similar images.
On the text side, we propose a novel semi-automatic renovation approach to
refine coarse-grained sentences into finer-grained ones with little human
effort. Furthermore, we evaluate representative image-text retrieval models on
our new benchmarks to demonstrate the effectiveness of our method. We also
analyze the capability of models on fine-grained semantic comprehension through
extensive experiments. The results show that even the state-of-the-art models
have much room for improvement in fine-grained semantic understanding,
especially in distinguishing attributes of close objects in images. Our code
and improved benchmark datasets are publicly available at:
https://github.com/cwj1412/MSCOCO-Flikcr30K_FG, which we hope will inspire
further in-depth research on cross-modal retrieval.
Related papers
- Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering [79.44443231700201]
Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair.
The input text and image are often not perfectly matched, and thus the image may introduce noise into the model.
We propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise.
arXiv Detail & Related papers (2023-09-09T09:41:36Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models [36.19590638188108]
We create new variants of texts and images in the MS-COCO test set and re-evaluate the state-of-the-art (SOTA) models with the new data.
Specifically, we alter the meaning of text by replacing a word, and generate visually altered images that maintain some visual context.
Our evaluations on the proposed benchmark reveal substantial performance degradation in many SOTA models.
arXiv Detail & Related papers (2023-04-21T03:45:59Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z) - Where Does the Performance Improvement Come From? - A Reproducibility
Concern about Image-Text Retrieval [85.03655458677295]
Image-text retrieval has gradually become a major research direction in the field of information retrieval.
We first examine the related concerns and why the focus is on image-text retrieval tasks.
We analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models.
arXiv Detail & Related papers (2022-03-08T05:01:43Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z) - An Effective Automatic Image Annotation Model Via Attention Model and
Data Equilibrium [0.0]
The proposed model has three phases, including a feature extractor, a tag generator, and an image annotator.
The experiments conducted on two benchmark datasets confirm that the superiority of the proposed model compared to the previous models.
arXiv Detail & Related papers (2020-01-26T05:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.