Mitigating Object Hallucination in Large Vision-Language Models via
Classifier-Free Guidance
- URL: http://arxiv.org/abs/2402.08680v1
- Date: Tue, 13 Feb 2024 18:59:05 GMT
- Title: Mitigating Object Hallucination in Large Vision-Language Models via
Classifier-Free Guidance
- Authors: Linxi Zhao and Yihe Deng and Weitong Zhang and Quanquan Gu
- Abstract summary: Large Vision-Language Models (LVLMs) tend to hallucinate non-existing objects in the images.
We introduce a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE)
MARINE is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process.
- Score: 56.04768229686853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advancement of Large Vision-Language Models (LVLMs) has increasingly
highlighted the critical issue of their tendency to hallucinate non-existing
objects in the images. To address this issue, previous works focused on using
specially curated datasets or powerful LLMs (e.g., GPT-3.5) to rectify the
outputs of LVLMs. However, these approaches require either expensive
training/fine-tuning or API access to advanced LLMs to correct the model's
output post-generation. In this paper, we tackle this challenge by introducing
a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE
(MARINE), which is both training-free and API-free, and can effectively and
efficiently reduce object hallucinations during the generation process.
Specifically, MARINE enriches the visual context of LVLMs by integrating
existing open-source vision models, and employs classifier-free guidance to
incorporate the additional object grounding features to improve the precision
of LVLMs' generations. Through comprehensive evaluations across $6$ popular
LVLMs with diverse evaluation metrics, we demonstrate the effectiveness of
MARINE, which even outperforms existing fine-tuning-based methods. Remarkably,
it not only reduces hallucinations but also improves the detailedness of LVLMs'
generations, as assessed by GPT-4V.
Related papers
- Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning [16.883679810267342]
Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination.
This paper introduces a novel approach called Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination.
arXiv Detail & Related papers (2024-10-16T00:15:40Z) - Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training [49.407311947143825]
We present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure.
We also propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP)
arXiv Detail & Related papers (2024-10-10T17:59:22Z) - Rethinking VLMs and LLMs for Image Classification [6.550471260627169]
Large Language Models (LLMs) are increasingly being merged with Visual Language Models (VLMs) to enable new capabilities.
We show that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do.
We propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task.
arXiv Detail & Related papers (2024-10-03T23:40:21Z) - CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs [37.98496239547762]
Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment.
We present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs.
arXiv Detail & Related papers (2024-08-19T21:56:20Z) - Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD) [13.430637580980164]
Large Vision-Language Models (LVLMs) are an extension of Large Language Models (LLMs) that facilitate processing both image and text inputs, expanding AI capabilities.
Our study introduces a Language Contrastive Decoding (LCD) algorithm that adjusts LVLM outputs based on Large Language Models distribution confidence levels.
Our method effectively improves LVLMs without needing complex post-processing or retraining, and is easily applicable to different models.
arXiv Detail & Related papers (2024-08-06T08:10:34Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - CIEM: Contrastive Instruction Evaluation Method for Better Instruction
Tuning [8.217445461627797]
Vision-Language Models (VLMs) may generate incorrect perception information when doing downstream applications, for example, captioning a non-existent entity.
To address the hallucination phenomenon, we introduce a Contrastive Instruction Evaluation Method (CIEM) and Contrastive Instruction Tuning (CIT)
We pinpoint the hallucination issues commonly present in existing VLMs, the disability of the current instruction-tuning dataset to handle the hallucination phenomenon and the superiority of CIT-tuned VLMs over both CIEM and public datasets.
arXiv Detail & Related papers (2023-09-05T15:06:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.