Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs
- URL: http://arxiv.org/abs/2406.18849v4
- Date: Mon, 24 Feb 2025 01:56:43 GMT
- Title: Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs
- Authors: Jie Zhang, Zhongqi Wang, Mengqi Lei, Zheng Yuan, Bei Yan, Shiguang Shan, Xilin Chen,
- Abstract summary: Dysca is a dynamic and scalable benchmark for evaluating LVLMs by leveraging synthesis images.<n>We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks.<n>Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios.
- Score: 61.01278660925202
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios. A total of 24 advanced open-source LVLMs and 2 close-source LVLMs are evaluated on Dysca, revealing the drawbacks of current LVLMs. The benchmark is released at https://github.com/Robin-WZQ/Dysca.
Related papers
- More Images, More Problems? A Controlled Analysis of VLM Failure Modes [80.64323947730905]
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored.<n>We introduce MIMIC, a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs.
arXiv Detail & Related papers (2026-01-12T18:45:13Z) - Zero-Shot Fine-Grained Image Classification Using Large Vision-Language Models [4.499940819352075]
Large Vision-Language Models (LVLMs) have demonstrated impressive performance on vision-language reasoning tasks.<n>We present a novel method that transforms zero-shot fine-grained image classification into a visual question-answering framework.<n>Our proposed method consistently outperforms the current state-of-the-art (SOTA) approach.
arXiv Detail & Related papers (2025-10-04T18:56:41Z) - VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes [36.370533774426555]
We present VisualOverload, a visual question answering (VQA) benchmark comprising 2,720 question-answer pairs.<n>Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated scenes.<n>We observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions.
arXiv Detail & Related papers (2025-09-29T18:00:25Z) - BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models [2.526146573337397]
We propose a new evaluation methodology, inspired by ophthalmologic diagnostics.<n>We use procedural generation of synthetic images to obtain control over visual attributes.<n>This diagnostic allows systematic stress testing and fine-grained failure analysis.
arXiv Detail & Related papers (2025-06-05T12:43:10Z) - TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images [1.8668361563848481]
TDBench is a comprehensive benchmark for Vision-Language Models (VLMs) in top-down image understanding.
It consists of visual question-answer pairs across ten evaluation dimensions of image understanding.
We conduct four case studies that commonly happen in real-world scenarios but are less explored.
arXiv Detail & Related papers (2025-04-01T19:01:13Z) - Are Large Vision Language Models Good Game Players? [25.49713745405194]
Large Vision Language Models (LVLMs) have demonstrated remarkable abilities in understanding and reasoning about both visual and textual information.
Existing evaluation methods for LVLMs, primarily based on benchmarks like Visual Question Answering, often fail to capture the full scope of LVLMs' capabilities.
We propose method, a game-based evaluation framework designed to provide a comprehensive assessment of LVLMs' cognitive and reasoning skills in structured environments.
arXiv Detail & Related papers (2025-03-04T07:29:03Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information.
We introduce AutoBench-V, an automated framework for serving evaluation on demand.
Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - Trust but Verify: Programmatic VLM Evaluation in the Wild [62.14071929143684]
Programmatic VLM Evaluation (PROVE) is a new benchmarking paradigm for evaluating VLM responses to open-ended queries.
We benchmark the helpfulness-truthfulness trade-offs of a range ofVLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two.
arXiv Detail & Related papers (2024-10-17T01:19:18Z) - DARE: Diverse Visual Question Answering with Robustness Evaluation [16.87867803628065]
Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models.
They struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning.
We introduce DARE, Diverse Visual Question Answering with Robustness Evaluation.
arXiv Detail & Related papers (2024-09-26T16:31:50Z) - MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs [88.28014831467503]
We introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset.
MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks.
We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations.
arXiv Detail & Related papers (2024-06-17T17:59:47Z) - Are We on the Right Way for Evaluating Large Vision-Language Models? [92.5761176224556]
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities.
We identify two primary issues: Visual content is unnecessary for many samples and intentional data leakage exists.
We present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans.
arXiv Detail & Related papers (2024-03-29T17:59:34Z) - Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision.
We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs.
We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z) - Revisiting Few-Shot Object Detection with Vision-Language Models [49.79495118650838]
We revisit the task of few-shot object detection (FSOD) in the context of recent foundational vision-language models (VLMs)
We propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data.
We discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community.
arXiv Detail & Related papers (2023-12-22T07:42:00Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z) - Few-Shot Image Classification Benchmarks are Too Far From Reality: Build
Back Better with Semantic Task Sampling [4.855663359344748]
We introduce a new benchmark for Few-Shot Image Classification using the Danish Fungi 2020 dataset.
This benchmark proposes a wide variety of evaluation tasks with various fine-graininess.
Our experiments bring out the correlation between the difficulty of a task and the semantic similarity between its classes.
arXiv Detail & Related papers (2022-05-10T20:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.