SugarCrepe: Fixing Hackable Benchmarks for Vision-Language
Compositionality
- URL: http://arxiv.org/abs/2306.14610v1
- Date: Mon, 26 Jun 2023 11:35:22 GMT
- Title: SugarCrepe: Fixing Hackable Benchmarks for Vision-Language
Compositionality
- Authors: Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, Ranjay
Krishna
- Abstract summary: We introduce SugarCrepe, a new benchmark for vision-language compositionality evaluation.
We employ large language models, instead of rule-based templates, to generate fluent and sensical hard negatives.
We re-evaluate state-of-the-art models and recently proposed compositionality inducing strategies, and find that their improvements were hugely overestimated.
- Score: 26.61030477161824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the last year alone, a surge of new benchmarks to measure compositional
understanding of vision-language models have permeated the machine learning
ecosystem. Given an image, these benchmarks probe a model's ability to identify
its associated caption amongst a set of compositional distractors.
Surprisingly, we find significant biases in all these benchmarks rendering them
hackable. This hackability is so dire that blind models with no access to the
image outperform state-of-the-art vision-language models. To remedy this
rampant vulnerability, we introduce SugarCrepe, a new benchmark for
vision-language compositionality evaluation. We employ large language models,
instead of rule-based templates used in previous benchmarks, to generate fluent
and sensical hard negatives, and utilize an adversarial refinement mechanism to
maximally reduce biases. We re-evaluate state-of-the-art models and recently
proposed compositionality inducing strategies, and find that their improvements
were hugely overestimated, suggesting that more innovation is needed in this
important direction. We release SugarCrepe and the code for evaluation at:
https://github.com/RAIVNLab/sugar-crepe.
Related papers
- A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks [32.052113371887124]
We investigate 17 benchmarks commonly used for measuring compositional understanding capabilities of vision-language models.<n>We scrutinize design choices in their construction, including data source and curation procedures.<n>We find that blind constructings perform on par with CLIP models, indicating that these benchmarks do not effectively measure compositional understanding.
arXiv Detail & Related papers (2025-06-09T20:53:43Z) - Boosting Alignment for Post-Unlearning Text-to-Image Generative Models [55.82190434534429]
Large-scale generative models have shown impressive image-generation capabilities, propelled by massive data.
This often inadvertently leads to the generation of harmful or inappropriate content and raises copyright concerns.
We propose a framework that seeks an optimal model update at each unlearning iteration, ensuring monotonic improvement on both objectives.
arXiv Detail & Related papers (2024-12-09T21:36:10Z) - Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings.
We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z) - CLoVe: Encoding Compositional Language in Contrastive Vision-Language
Models [33.80107512462935]
Foundational Vision-Language Models (VLMs) excel at object-centric recognition yet learn text representations that seem invariant to word order.
No evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully.
In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language.
arXiv Detail & Related papers (2024-02-22T23:42:25Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Scalable Performance Analysis for Vision-Language Models [26.45624201546282]
Joint vision-language models have shown great performance over a diverse set of tasks.
Our paper introduces a more scalable solution that relies on already annotated benchmarks.
We confirm previous findings that CLIP behaves like a bag of words model and performs better with nouns and verbs.
arXiv Detail & Related papers (2023-05-30T06:40:08Z) - RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models [36.19590638188108]
We create new variants of texts and images in the MS-COCO test set and re-evaluate the state-of-the-art (SOTA) models with the new data.
Specifically, we alter the meaning of text by replacing a word, and generate visually altered images that maintain some visual context.
Our evaluations on the proposed benchmark reveal substantial performance degradation in many SOTA models.
arXiv Detail & Related papers (2023-04-21T03:45:59Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Estimating the Robustness of Classification Models by the Structure of
the Learned Feature-Space [10.418647759223964]
We argue that fixed testsets are only able to capture a small portion of possible data variations and are thus limited and prone to generate new overfitted solutions.
To overcome these drawbacks, we suggest to estimate the robustness of a model directly from the structure of its learned feature-space.
arXiv Detail & Related papers (2021-06-23T10:52:29Z) - Word Shape Matters: Robust Machine Translation with Visual Embedding [78.96234298075389]
We introduce a new encoding of the input symbols for character-level NLP models.
It encodes the shape of each character through the images depicting the letters when printed.
We name this new strategy visual embedding and it is expected to improve the robustness of NLP models.
arXiv Detail & Related papers (2020-10-20T04:08:03Z) - RobustBench: a standardized adversarial robustness benchmark [84.50044645539305]
Key challenge in benchmarking robustness is that its evaluation is often error-prone leading to robustness overestimation.
We evaluate adversarial robustness with AutoAttack, an ensemble of white- and black-box attacks.
We analyze the impact of robustness on the performance on distribution shifts, calibration, out-of-distribution detection, fairness, privacy leakage, smoothness, and transferability.
arXiv Detail & Related papers (2020-10-19T17:06:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.