Multi-Modal Hallucination Control by Visual Information Grounding
- URL: http://arxiv.org/abs/2403.14003v1
- Date: Wed, 20 Mar 2024 22:05:18 GMT
- Title: Multi-Modal Hallucination Control by Visual Information Grounding
- Authors: Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto,
- Abstract summary: We show that Generative Vision-Language Models (VLMs) are prone to generate plausible-sounding textual answers that are not always grounded in the input image.
We introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for prompt amplification.
M3ID amplifies the influence of the reference image over the language prior, hence favoring the generation of tokens with higher mutual information with the visual prompt.
- Score: 121.6983694815504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative Vision-Language Models (VLMs) are prone to generate plausible-sounding textual answers that, however, are not always grounded in the input image. We investigate this phenomenon, usually referred to as "hallucination" and show that it stems from an excessive reliance on the language prior. In particular, we show that as more tokens are generated, the reliance on the visual prompt decreases, and this behavior strongly correlates with the emergence of hallucinations. To reduce hallucinations, we introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for prompt amplification. M3ID amplifies the influence of the reference image over the language prior, hence favoring the generation of tokens with higher mutual information with the visual prompt. M3ID can be applied to any pre-trained autoregressive VLM at inference time without necessitating further training and with minimal computational overhead. If training is an option, we show that M3ID can be paired with Direct Preference Optimization (DPO) to improve the model's reliance on the prompt image without requiring any labels. Our empirical findings show that our algorithms maintain the fluency and linguistic capabilities of pre-trained VLMs while reducing hallucinations by mitigating visually ungrounded answers. Specifically, for the LLaVA 13B model, M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve the accuracy on VQA benchmarks such as POPE by 21% and 24%.
Related papers
- Mitigating Object Hallucination via Data Augmented Contrastive Tuning [52.43197107069751]
Multimodal Large Language Models (MLLMs) tend to hallucinate factually inaccurate information.
We introduce a contrastive tuning method that can be applied to a pretrained off-the-shelf MLLM for mitigating hallucinations.
arXiv Detail & Related papers (2024-05-28T23:36:00Z) - VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap [52.497823009176074]
We perform an in-depth analysis of hallucinations and discover several novel insights about how and when LVLMs hallucinate.
To overcome this shortcoming, we propose Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method for alleviating hallucinations.
arXiv Detail & Related papers (2024-05-24T16:21:59Z) - Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [48.065569871444275]
We propose detecting and mitigating hallucinations in Large Vision Language Models (LVLMs) via fine-grained AI feedback.
We generate a small-size hallucination annotation dataset by proprietary models.
Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model.
arXiv Detail & Related papers (2024-04-22T14:46:10Z) - Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z) - Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware
Direct Preference Optimization [45.53216822981202]
This paper introduces a novel solution, Hallucination-Aware Direct Preference Optimization (HA-DPO), which reframes the hallucination problem as a preference selection task.
When applied to three mainstream multimodal models, HA-DPO significantly reduced hallucination issues and amplified the models' generalization capabilities.
arXiv Detail & Related papers (2023-11-28T14:54:37Z) - Mitigating Hallucination in Visual Language Models with Visual
Supervision [33.05550629039951]
Large vision-language models (LVLMs) suffer from hallucination a lot.
Key problem lies in its weak ability to comprehend detailed content in a multi-modal context.
In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs.
arXiv Detail & Related papers (2023-11-27T09:30:02Z) - Detecting and Preventing Hallucinations in Large Vision Language Models [4.7264116948935975]
M-HalDetect is the first multi-modal hallucination detection dataset for detailed image descriptions.
We train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling.
We find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively.
arXiv Detail & Related papers (2023-08-11T21:35:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.