Related papers: RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models

RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models

URL: http://arxiv.org/abs/2304.10727v4
Date: Fri, 27 Sep 2024 01:40:17 GMT
Title: RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models
Authors: Seulki Park, Daeho Um, Hajung Yoon, Sanghyuk Chun, Sangdoo Yun,
Abstract summary: We create new variants of texts and images in the MS-COCO test set and re-evaluate the state-of-the-art (SOTA) models with the new data. Specifically, we alter the meaning of text by replacing a word, and generate visually altered images that maintain some visual context. Our evaluations on the proposed benchmark reveal substantial performance degradation in many SOTA models.
Score: 36.19590638188108
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the extensive use of vision-language models in various downstream tasks, evaluating their robustness is crucial. In this paper, we propose a benchmark for assessing the robustness of vision-language models. We believe that a robust model should properly understand both linguistic and visual semantics and be resilient to explicit variations. In pursuit of this goal, we create new variants of texts and images in the MS-COCO test set and re-evaluate the state-of-the-art (SOTA) models with the new data. Specifically, we alter the meaning of text by replacing a word, and generate visually altered images that maintain some visual context while introducing noticeable pixel changes through image mixing techniques.Our evaluations on the proposed benchmark reveal substantial performance degradation in many SOTA models (e.g., Image-to-Text Recall@1: 81.9\% $\rightarrow$ 48.4\% in BLIP, 66.1\% $\rightarrow$ 37.6\% in VSE$\infty$), with the models often favoring the altered texts/images over the original ones. This indicates the current vision-language models struggle with subtle changes and often fail to understand the overall context of texts and images. Based on these findings, we propose semantic contrastive loss and visual contrastive loss to learn more robust embedding. Datasets and code are available at {\url{https://github.com/pseulki/rococo}}.

Related papers

Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this paper, we propose an end-to-end IIMT model consisting of four modules. Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z)
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark. FineMatch focuses on text and image mismatch detection and correction. We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z)
ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes [64.57705752579207]
We evaluate the resilience of vision-based models against diverse object-to-background context variations. We harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate object-to-background changes.
arXiv Detail & Related papers (2024-03-07T17:48:48Z)
Holistic Evaluation of Text-To-Image Models [153.47415461488097]
We introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM) We identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths.
arXiv Detail & Related papers (2023-11-07T19:00:56Z)
A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation [9.552642210681489]
We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example.
arXiv Detail & Related papers (2023-10-25T14:10:08Z)
Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet Hierarchy [12.82992353036576]
We measure the capability of popular text-to-image models to understand $textithypernymy$, or the "is-a" relation between words. We show how our metrics can provide a better understanding of the individual strengths and weaknesses of popular text-to-image models.
arXiv Detail & Related papers (2023-10-13T16:53:25Z)
Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness [0.932065750652415]
TIDA (Targeted Image-editing Data Augmentation) is a targeted data augmentation method focused on improving models' human-like abilities. We show that a TIDA-enhanced dataset related to gender, color, and counting abilities induces better performance in several image captioning metrics.
arXiv Detail & Related papers (2023-09-27T20:12:41Z)
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining. We propose COSA, a COncatenated SAmple pretrained vision-language foundation model. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z)
Rethinking Benchmarks for Cross-modal Image-text Retrieval [44.31783230767321]
Cross-modal semantic understanding and matching is a major challenge in image-text retrieval. In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching. We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort. The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
arXiv Detail & Related papers (2023-04-21T09:07:57Z)
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z)
Image Retrieval from Contextual Descriptions [22.084939474881796]
Image Retrieval from Contextual Descriptions (ImageCoDe) Models tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description. Best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.
arXiv Detail & Related papers (2022-03-29T19:18:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.