Holistic Analysis of Hallucination in GPT-4V(ision): Bias and
Interference Challenges
- URL: http://arxiv.org/abs/2311.03287v2
- Date: Tue, 7 Nov 2023 02:18:48 GMT
- Title: Holistic Analysis of Hallucination in GPT-4V(ision): Bias and
Interference Challenges
- Authors: Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James
Zou, Huaxiu Yao
- Abstract summary: This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference.
bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data.
interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented.
- Score: 54.42256219010956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While GPT-4V(ision) impressively models both visual and textual information
simultaneously, it's hallucination behavior has not been systematically
assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias
and Interference Challenges in Visual Language Models (Bingo). This benchmark
is designed to evaluate and shed light on the two common types of
hallucinations in visual language models: bias and interference. Here, bias
refers to the model's tendency to hallucinate certain types of responses,
possibly due to imbalance in its training data. Interference pertains to
scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the
text prompt is phrased or how the input image is presented. We identify a
notable regional bias, whereby GPT-4V(ision) is better at interpreting Western
images or images with English writing compared to images from other countries
or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to
leading questions and is often confused when interpreting multiple images
together. Popular mitigation approaches, such as self-correction and
chain-of-thought reasoning, are not effective in resolving these challenges. We
also identified similar biases and interference vulnerabilities with LLaVA and
Bard. Our results characterize the hallucination challenges in GPT-4V(ision)
and state-of-the-art visual-language models, and highlight the need for new
solutions. The Bingo benchmark is available at https://github.com/gzcch/Bingo.
Related papers
- A Unified Hallucination Mitigation Framework for Large Vision-Language Models [18.595958586621943]
We present a unified framework, Dentist, for hallucination mitigation.
The core step is to first classify the queries, then perform different processes of hallucination mitigation based on the classification result.
On MMbench, we achieve a 13.44%/10.2%/15.8% improvement in accuracy on Image Quality.
arXiv Detail & Related papers (2024-09-24T22:36:58Z) - ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images.
We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline.
We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - Evaluating GPT-4's Vision Capabilities on Brazilian University Admission
Exams [14.801853435122908]
We present a framework to evaluate language models on entrance exams, which incorporates both textual and visual elements.
We evaluate the two most recent editions of Exame Nacional do Ensino M'edio (ENEM), the main standardized entrance examination adopted by Brazilian universities.
One of the highlights is that text captions transcribing visual content outperform the direct use of images, suggesting that the vision model has room for improvement.
arXiv Detail & Related papers (2023-11-23T19:20:59Z) - An Early Evaluation of GPT-4V(ision) [40.866323649060696]
We evaluate different abilities of GPT-4V including visual understanding, language understanding, visual puzzle solving, and understanding of other modalities such as depth, thermal, video, and audio.
To estimate GPT-4V's performance, we manually construct 656 test instances and carefully evaluate the results of GPT-4V.
arXiv Detail & Related papers (2023-10-25T10:33:17Z) - HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models [69.52245481329899]
We introduce HallusionBench, a benchmark for the evaluation of image-context reasoning.
The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts.
In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V.
arXiv Detail & Related papers (2023-10-23T04:49:09Z) - Evaluating Hallucinations in Chinese Large Language Models [65.4771562909392]
We establish a benchmark named HalluQA (Chinese Hallucination Question-Answering) to measure the hallucination phenomenon in Chinese large language models.
We consider two types of hallucinations: imitative falsehoods and factual errors, and we construct adversarial samples based on GLM-130B and ChatGPT.
For evaluation, we design an automated evaluation method using GPT-4 to judge whether a model output is hallucinated.
arXiv Detail & Related papers (2023-10-05T07:57:09Z) - VALHALLA: Visual Hallucination for Machine Translation [64.86515924691899]
We introduce a visual hallucination framework, called VALHALLA.
It requires only source sentences at inference time and instead uses hallucinated visual representations for multimodal machine translation.
In particular, given a source sentence an autoregressive hallucination transformer is used to predict a discrete visual representation from the input text.
arXiv Detail & Related papers (2022-05-31T20:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.