Related papers: DARE: Diverse Visual Question Answering with Robustness Evaluation

DARE: Diverse Visual Question Answering with Robustness Evaluation

URL: http://arxiv.org/abs/2409.18023v1
Date: Thu, 26 Sep 2024 16:31:50 GMT
Title: DARE: Diverse Visual Question Answering with Robustness Evaluation
Authors: Hannah Sterz, Jonas Pfeiffer, Ivan Vulić,
Abstract summary: Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models. They struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. We introduce DARE, Diverse Visual Question Answering with Robustness Evaluation.
Score: 16.87867803628065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models, and are able to learn from and process multi-modal vision-text input. While modern VLMs perform well on a number of standard image classification and image-text matching tasks, they still struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. Moreover, while they might be very brittle to small variations in instructions and/or evaluation protocols, existing benchmarks fail to evaluate their robustness (or rather the lack of it). In order to couple challenging VL scenarios with comprehensive robustness evaluation, we introduce DARE, Diverse Visual Question Answering with Robustness Evaluation, a carefully created and curated multiple-choice VQA benchmark. DARE evaluates VLM performance on five diverse categories and includes four robustness-oriented evaluations based on the variations of: prompts, the subsets of answer options, the output format and the number of correct answers. Among a spectrum of other findings, we report that state-of-the-art VLMs still struggle with questions in most categories and are unable to consistently deliver their peak performance across the tested robustness evaluations. The worst case performance across the subsets of options is up to 34% below the performance in the standard case. The robustness of the open-source VLMs such as LLaVA 1.6 and Idefics2 cannot match the closed-source models such as GPT-4 and Gemini, but even the latter remain very brittle to different variations.

Related papers

PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models [17.522361689805724]
Vision language models (VLMs) respond to user-crafted text prompts and visual inputs.<n>It is crucial to determine whether VLMs inherit this instability to varying prompts.<n>We introduce PARC (Prompt Analysis via Reliability and agnostic), a VLM prompt sensitivity analysis framework.
arXiv Detail & Related papers (2025-06-03T19:42:32Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories. These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
V-MAGE is a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark [0.8820880683910832]
Vision Language Models (VLMs) have demonstrated strong reasoning capabilities in Visual Question Answering (VQA) tasks. However, their ability to perform Theory of Mind (ToM) tasks, such as inferring human intentions, beliefs, and mental states, remains underexplored. We propose an open-ended question framework to evaluate VLMs' performance across diverse categories of ToM tasks.
arXiv Detail & Related papers (2025-03-28T02:26:32Z)
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering [28.045285777736876]
We introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. It enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness.
arXiv Detail & Related papers (2025-03-09T07:25:32Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [82.92771279118888]
We introduce Multimodal RewardBench, an expert-annotated benchmark for evaluating multimodal reward models. Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various vision-language models. We find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy.
arXiv Detail & Related papers (2025-02-20T01:48:13Z)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information. We introduce AutoBench-V, an automated framework for serving evaluation on demand. Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types [0.9217021281095907]
Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience. We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments.
arXiv Detail & Related papers (2024-09-14T02:29:36Z)
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities [146.4724093405187]
We introduce MM-Vet v2, which includes a new "image-text sequence understanding" capability called "image-text sequence understanding" Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0.
arXiv Detail & Related papers (2024-08-01T17:59:54Z)
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models [34.843603169616486]
We introduce AlignMMBench, a comprehensive alignment benchmark for emerging Chinese Vision-Language Models (VLMs) This benchmark is meticulously curated from real-world scenarios and Chinese Internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. To facilitate the evaluation pipeline, we propose CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4's evaluation ability.
arXiv Detail & Related papers (2024-06-13T16:30:14Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs [55.91371032213854]
This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning. We introduce a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness.
arXiv Detail & Related papers (2023-11-27T18:59:42Z)
MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models. MMBench is meticulously curated with well-designed quality control schemes. MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.