A Benchmark for Multi-modal Foundation Models on Low-level Vision: from
Single Images to Pairs
- URL: http://arxiv.org/abs/2402.07116v1
- Date: Sun, 11 Feb 2024 06:44:11 GMT
- Title: A Benchmark for Multi-modal Foundation Models on Low-level Vision: from
Single Images to Pairs
- Authors: Zicheng Zhang, Haoning Wu, Erli Zhang, Guangtao Zhai, Weisi Lin
- Abstract summary: We design benchmark settings to emulate human language responses related to low-level vision.
We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs.
We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
- Score: 76.24832641793621
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid development of Multi-modality Large Language Models (MLLMs) has
navigated a paradigm shift in computer vision, moving towards versatile
foundational models. However, evaluating MLLMs in low-level visual perception
and understanding remains a yet-to-explore domain. To this end, we design
benchmark settings to emulate human language responses related to low-level
vision: the low-level visual perception (A1) via visual question answering
related to low-level attributes (e.g. clarity, lighting); and the low-level
visual description (A2), on evaluating MLLMs for low-level text descriptions.
Furthermore, given that pairwise comparison can better avoid ambiguity of
responses and has been adopted by many human experiments, we further extend the
low-level perception-related question-answering and description evaluations of
MLLMs from single images to image pairs. Specifically, for perception (A1), we
carry out the LLVisionQA+ dataset, comprising 2,990 single images and 1,999
image pairs each accompanied by an open-ended question about its low-level
features; for description (A2), we propose the LLDescribe+ dataset, evaluating
MLLMs for low-level descriptions on 499 single images and 450 pairs.
Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting
score, by employing a softmax-based approach to enable all MLLMs to generate
quantifiable quality ratings, tested against human opinions in 7 image quality
assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that
several MLLMs have decent low-level visual competencies on single images, but
only GPT-4V exhibits higher accuracy on pairwise comparisons than single image
evaluations (like humans). We hope that our benchmark will motivate further
research into uncovering and enhancing these nascent capabilities of MLLMs.
Datasets will be available at https://github.com/Q-Future/Q-Bench.
Related papers
- Finer: Investigating and Enhancing Fine-Grained Visual Concept
Recognition in Large Vision Language Models [68.46457611340097]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs [36.42188183017291]
We identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers.
To quantify the effect, we propose CorrelationQA, the first benchmark that assesses the hallucination level given spurious images.
We conduct a thorough analysis on 9 mainstream MLLMs, illustrating that they universally suffer from this instinctive bias to varying degrees.
arXiv Detail & Related papers (2024-02-06T06:48:46Z) - Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
Reasoning over Image Sequences [80.54979242912944]
This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities.
We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
arXiv Detail & Related papers (2024-01-19T07:10:13Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level
Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models.
We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.