The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs
- URL: http://arxiv.org/abs/2402.03757v1
- Date: Tue, 6 Feb 2024 06:48:46 GMT
- Title: The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs
- Authors: Tianyang Han, Qing Lian, Rui Pan, Renjie Pi, Jipeng Zhang, Shizhe
Diao, Yong Lin, Tong Zhang
- Abstract summary: We identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers.
To quantify the effect, we propose CorrelationQA, the first benchmark that assesses the hallucination level given spurious images.
We conduct a thorough analysis on 9 mainstream MLLMs, illustrating that they universally suffer from this instinctive bias to varying degrees.
- Score: 36.42188183017291
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have recently experienced remarkable progress,
where the advent of multi-modal large language models (MLLMs) has endowed LLMs
with visual capabilities, leading to impressive performances in various
multi-modal tasks. However, those powerful MLLMs such as GPT-4V still fail
spectacularly when presented with certain image and text inputs. In this paper,
we identify a typical class of inputs that baffles MLLMs, which consist of
images that are highly relevant but inconsistent with answers, causing MLLMs to
suffer from hallucination. To quantify the effect, we propose CorrelationQA,
the first benchmark that assesses the hallucination level given spurious
images. This benchmark contains 7,308 text-image pairs across 13 categories.
Based on the proposed CorrelationQA, we conduct a thorough analysis on 9
mainstream MLLMs, illustrating that they universally suffer from this
instinctive bias to varying degrees. We hope that our curated benchmark and
evaluation results aid in better assessments of the MLLMs' robustness in the
presence of misleading images. The resource is available in
https://github.com/MasaiahHan/CorrelationQA.
Related papers
- MIBench: Evaluating Multimodal Large Language Models over Multiple Images [70.44423964171088]
We propose a new benchmark MIBench to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios.
MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples.
The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs.
arXiv Detail & Related papers (2024-07-21T21:22:58Z) - A Benchmark for Multi-modal Foundation Models on Low-level Vision: from
Single Images to Pairs [76.24832641793621]
We design benchmark settings to emulate human language responses related to low-level vision.
We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs.
We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z) - MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark [41.68821233828375]
This paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities.
Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking.
arXiv Detail & Related papers (2024-02-07T12:28:32Z) - Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
Reasoning over Image Sequences [80.54979242912944]
This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities.
We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
arXiv Detail & Related papers (2024-01-19T07:10:13Z) - MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception [21.60103376506254]
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual perception and understanding.
These models also suffer from hallucinations, which limit their reliability as AI systems.
This paper aims to define and evaluate the self-awareness of MLLMs in perception.
arXiv Detail & Related papers (2024-01-15T08:19:22Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z) - AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination
Evaluation [58.19101663976327]
Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucinations.
evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment.
We propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task.
arXiv Detail & Related papers (2023-11-13T15:25:42Z) - Investigating the Catastrophic Forgetting in Multimodal Large Language
Models [43.89009178021342]
We introduce EMT: evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs.
Almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks.
As fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability.
arXiv Detail & Related papers (2023-09-19T04:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.