Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model
- URL: http://arxiv.org/abs/2312.06968v4
- Date: Sat, 24 Feb 2024 03:34:59 GMT
- Title: Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model
- Authors: Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming
Yan, Qinghao Ye, Ji Zhang, Fei Huang, Shikun Zhang
- Abstract summary: Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
- Score: 53.65682783591723
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-modal large language models (MLLMs) have been shown to efficiently
integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where
they tend to generate erroneous or fabricated information. In this paper, we
address hallucinations in MLLMs from a novel perspective of representation
learning. We first analyzed the representation distribution of textual and
visual tokens in MLLM, revealing two important findings: 1) there is a
significant gap between textual and visual representations, indicating
unsatisfactory cross-modal representation alignment; 2) representations of
texts that contain and do not contain hallucinations are entangled, making it
challenging to distinguish them. These two observations inspire us with a
simple yet effective method to mitigate hallucinations. Specifically, we
introduce contrastive learning into MLLMs and use text with hallucination as
hard negative examples, naturally bringing representations of non-hallucinative
text and visual samples closer while pushing way representations of
non-hallucinating and hallucinative text. We evaluate our method quantitatively
and qualitatively, showing its effectiveness in reducing hallucination
occurrences and improving performance across multiple benchmarks. On the
MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the
baseline MiniGPT-4/LLaVA. Our code is available on
https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl.
Related papers
- HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding [36.360171373963716]
Large Vision-Language Models (LVLMs) have shown remarkable performance on many visual-language tasks.
These models still suffer from multimodal hallucination, which means the generation of objects or content that violates the images.
We propose Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD) to address this issue.
arXiv Detail & Related papers (2024-09-30T15:52:05Z) - Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs [54.50483041708911]
Hallu-PI is the first benchmark designed to evaluate hallucination in MLLMs within Perturbed Inputs.
Hallu-PI consists of seven perturbed scenarios, containing 1,260 perturbed images from 11 object types.
Our research reveals a severe bias in MLLMs' ability to handle different types of hallucinations.
arXiv Detail & Related papers (2024-08-02T16:07:15Z) - Mitigating Multilingual Hallucination in Large Vision-Language Models [35.75851356840673]
We propose a two-stage Multilingual Hallucination Removal (MHR) framework for Large Vision-Language Models (LVLMs)
Instead of relying on the intricate manual annotations of multilingual resources, we propose a novel cross-lingual alignment method.
Our framework delivers an average increase of 19.0% in accuracy across 13 different languages.
arXiv Detail & Related papers (2024-08-01T13:34:35Z) - MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification [1.3654846342364308]
We introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost.
Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs which have been overseen in previous works.
We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2024-05-29T15:28:42Z) - Data-augmented phrase-level alignment for mitigating object hallucination [52.43197107069751]
Multimodal Large Language Models (MLLMs) often generate factually inaccurate information, referred to as hallucination.
We introduce Data-augmented Phrase-level Alignment (DPA), a novel loss which can be applied to instruction-tuned off-the-shelf MLLMs to mitigate hallucinations.
arXiv Detail & Related papers (2024-05-28T23:36:00Z) - Hallucination Diversity-Aware Active Learning for Text Summarization [46.00645048690819]
Large Language Models (LLMs) have shown propensity to generate hallucinated outputs, i.e., texts that are factually incorrect or unsupported.
Existing methods for alleviating hallucinations typically require costly human annotations to identify and correct hallucinations in LLM outputs.
We propose the first active learning framework to alleviate LLM hallucinations, reducing costly human annotations of hallucination needed.
arXiv Detail & Related papers (2024-04-02T02:30:27Z) - Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination [14.25488878224697]
We propose Pensieve, a training-free method that leverages the analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics.
Pensieve mitigates the effects of addressing errors from both the visual and textual branches by adaptively scaling the subtracted scores.
arXiv Detail & Related papers (2024-03-21T13:49:42Z) - HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data [102.56792377624927]
hallucinations inherent in machine-generated data remain under-explored.
We present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm.
Our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA.
arXiv Detail & Related papers (2023-11-22T04:52:58Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z) - Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.