The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
- URL: http://arxiv.org/abs/2410.12787v1
- Date: Wed, 16 Oct 2024 17:59:02 GMT
- Title: The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
- Authors: Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing,
- Abstract summary: This paper presents the first systematic investigation of hallucinations in large multimodal models (LMMs)
Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations.
Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning.
- Score: 118.75449542080746
- License:
- Abstract: Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multimodal input and the generated textual output, which has limited their applicability in various real-world scenarios. This paper presents the first systematic investigation of hallucinations in LMMs involving the three most common modalities: language, visual, and audio. Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations. To address these challenges, we introduce the benchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluates hallucinations in LMMs, providing a detailed analysis of their underlying issues. Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning and enhanced hallucination mitigation strategies. Based on our observations and findings, we suggest potential research directions that could enhance the reliability of LMMs.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation.
Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs.
Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z) - Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models [12.841405829775852]
We introduce the modality importance score (MIS) to identify bias inVidQA benchmarks and datasets.
We also propose an innovative method using state-of-the-art MLLMs to estimate the modality importance.
Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets.
arXiv Detail & Related papers (2024-08-22T23:32:42Z) - Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks.
Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs.
We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z) - Hallucination of Multimodal Large Language Models: A Survey [40.73148186369018]
multimodal large language models (MLLMs) have demonstrated significant advancements and remarkable abilities in multimodal tasks.
Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content.
This survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field.
arXiv Detail & Related papers (2024-04-29T17:59:41Z) - Examining Modality Incongruity in Multimodal Federated Learning for
Medical Vision and Language-based Disease Detection [7.515840210206994]
The impact of missing modality in different clients, also called modality incongruity, has been greatly overlooked.
This paper, for the first time, analyses the impact of modality incongruity and reveals its connection with data heterogeneity across participating clients.
arXiv Detail & Related papers (2024-02-07T22:16:53Z) - Beyond Task Performance: Evaluating and Reducing the Flaws of Large
Multimodal Models with In-Context Learning [105.77733287326308]
We evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following.
We explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations.
Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL.
arXiv Detail & Related papers (2023-10-01T12:02:59Z) - Siren's Song in the AI Ocean: A Survey on Hallucination in Large
Language Models [116.01843550398183]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks.
LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge.
arXiv Detail & Related papers (2023-09-03T16:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.