The Instinctive Bias: Spurious Images lead to Illusion in MLLMs
- URL: http://arxiv.org/abs/2402.03757v2
- Date: Thu, 03 Oct 2024 03:40:20 GMT
- Title: The Instinctive Bias: Spurious Images lead to Illusion in MLLMs
- Authors: Tianyang Han, Qing Lian, Rui Pan, Renjie Pi, Jipeng Zhang, Shizhe Diao, Yong Lin, Tong Zhang,
- Abstract summary: We identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers.
We propose CorrelationQA, the first benchmark that assesses the visual illusion level given spurious images.
We conduct a thorough analysis on 9 mainstream MLLMs, illustrating that they universally suffer from this instinctive bias to varying degrees.
- Score: 34.91795817316696
- License:
- Abstract: Large language models (LLMs) have recently experienced remarkable progress, where the advent of multi-modal large language models (MLLMs) has endowed LLMs with visual capabilities, leading to impressive performances in various multi-modal tasks. However, those powerful MLLMs such as GPT-4V still fail spectacularly when presented with certain image and text inputs. In this paper, we identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers, causing MLLMs to suffer from visual illusion. To quantify the effect, we propose CorrelationQA, the first benchmark that assesses the visual illusion level given spurious images. This benchmark contains 7,308 text-image pairs across 13 categories. Based on the proposed CorrelationQA, we conduct a thorough analysis on 9 mainstream MLLMs, illustrating that they universally suffer from this instinctive bias to varying degrees. We hope that our curated benchmark and evaluation results aid in better assessments of the MLLMs' robustness in the presence of misleading images. The code and datasets are available at https://github.com/MasaiahHan/CorrelationQA.
Related papers
- MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - LIME: Less Is More for MLLM Evaluation [36.29820380945517]
We propose LIME (Less Is More for MLLM Evaluation), a benchmark curated through a semi-automated pipeline.
This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that necessitate image-based understanding.
Our experiments indicate that LIME reduces the number of samples by 76% and evaluation time by 77%, while also providing a more effective means of distinguishing the capabilities of different models.
arXiv Detail & Related papers (2024-09-10T20:19:14Z) - MIBench: Evaluating Multimodal Large Language Models over Multiple Images [70.44423964171088]
We propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios.
Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC)
The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs.
arXiv Detail & Related papers (2024-07-21T21:22:58Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision.
We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs.
We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z) - Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
Reasoning over Image Sequences [80.54979242912944]
This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities.
We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
arXiv Detail & Related papers (2024-01-19T07:10:13Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z) - Investigating the Catastrophic Forgetting in Multimodal Large Language
Models [43.89009178021342]
We introduce EMT: evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs.
Almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks.
As fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability.
arXiv Detail & Related papers (2023-09-19T04:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.