Mitigating Hallucination in Visual Language Models with Visual
Supervision
- URL: http://arxiv.org/abs/2311.16479v1
- Date: Mon, 27 Nov 2023 09:30:02 GMT
- Title: Mitigating Hallucination in Visual Language Models with Visual
Supervision
- Authors: Zhiyang Chen, Yousong Zhu, Yufei Zhan, Zhaowen Li, Chaoyang Zhao,
Jinqiao Wang, Ming Tang
- Abstract summary: Large vision-language models (LVLMs) suffer from hallucination a lot.
Key problem lies in its weak ability to comprehend detailed content in a multi-modal context.
In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs.
- Score: 33.05550629039951
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large vision-language models (LVLMs) suffer from hallucination a lot,
generating responses that apparently contradict to the image content
occasionally. The key problem lies in its weak ability to comprehend detailed
content in a multi-modal context, which can be mainly attributed to two factors
in training data and loss function. The vision instruction dataset primarily
focuses on global description, and the auto-regressive loss function favors
text modeling rather than image understanding. In this paper, we bring more
detailed vision annotations and more discriminative vision models to facilitate
the training of LVLMs, so that they can generate more precise responses without
encounter hallucination. On one hand, we generate image-text pairs with
detailed relationship annotations in panoptic scene graph dataset (PSG). These
conversations pay more attention on detailed facts in the image, encouraging
the model to answer questions based on multi-modal contexts. On the other hand,
we integrate SAM and mask prediction loss as auxiliary supervision, forcing the
LVLMs to have the capacity to identify context-related objects, so that they
can generate more accurate responses, mitigating hallucination. Moreover, to
provide a deeper evaluation on the hallucination in LVLMs, we propose a new
benchmark, RAH-Bench. It divides vision hallucination into three different
types that contradicts the image with wrong categories, attributes or
relations, and introduces False Positive Rate as detailed sub-metric for each
type. In this benchmark, our approach demonstrates an +8.4% enhancement
compared to original LLaVA and achieves widespread performance improvements
across other models.
Related papers
- Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization [21.248617886995103]
We propose Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time.
Our analysis indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context.
arXiv Detail & Related papers (2024-11-05T01:24:37Z) - HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding [36.360171373963716]
Large Vision-Language Models (LVLMs) have shown remarkable performance on many visual-language tasks.
These models still suffer from multimodal hallucination, which means the generation of objects or content that violates the images.
We propose Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD) to address this issue.
arXiv Detail & Related papers (2024-09-30T15:52:05Z) - FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs [12.533011020126855]
We introduce the FIHA (autonomous Fine-graIned Hallucination evAluation evaluation in LVLMs)
FIHA could access hallucination LVLMs in the LLM-free and annotation-free way and model the dependency between different types of hallucinations.
We introduce a benchmark called FIHA-v1, which consists of diverse questions on various images from MSCOCO and Foggy.
arXiv Detail & Related papers (2024-09-20T16:19:53Z) - Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning [24.270713960060142]
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension.
They still suffer from hallucination problems referring to generating inconsistent outputs with the image content.
We propose a training-free framework, textbfMVP, that aims to reduce hallucinations by making the most of the innate capabilities of the LVLMs.
arXiv Detail & Related papers (2024-08-30T09:40:10Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - Multi-Modal Hallucination Control by Visual Information Grounding [121.6983694815504]
We show that Generative Vision-Language Models (VLMs) are prone to generate plausible-sounding textual answers that are not always grounded in the input image.
We introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for prompt amplification.
M3ID amplifies the influence of the reference image over the language prior, hence favoring the generation of tokens with higher mutual information with the visual prompt.
arXiv Detail & Related papers (2024-03-20T22:05:18Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z) - Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages.
LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images.
We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z) - Detecting and Preventing Hallucinations in Large Vision Language Models [4.7264116948935975]
M-HalDetect is the first multi-modal hallucination detection dataset for detailed image descriptions.
We train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling.
We find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively.
arXiv Detail & Related papers (2023-08-11T21:35:20Z) - Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.