Mitigating Object Hallucination in Large Vision-Language Models via
Classifier-Free Guidance
- URL: http://arxiv.org/abs/2402.08680v1
- Date: Tue, 13 Feb 2024 18:59:05 GMT
- Title: Mitigating Object Hallucination in Large Vision-Language Models via
Classifier-Free Guidance
- Authors: Linxi Zhao and Yihe Deng and Weitong Zhang and Quanquan Gu
- Abstract summary: Large Vision-Language Models (LVLMs) tend to hallucinate non-existing objects in the images.
We introduce a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE)
MARINE is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process.
- Score: 56.04768229686853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advancement of Large Vision-Language Models (LVLMs) has increasingly
highlighted the critical issue of their tendency to hallucinate non-existing
objects in the images. To address this issue, previous works focused on using
specially curated datasets or powerful LLMs (e.g., GPT-3.5) to rectify the
outputs of LVLMs. However, these approaches require either expensive
training/fine-tuning or API access to advanced LLMs to correct the model's
output post-generation. In this paper, we tackle this challenge by introducing
a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE
(MARINE), which is both training-free and API-free, and can effectively and
efficiently reduce object hallucinations during the generation process.
Specifically, MARINE enriches the visual context of LVLMs by integrating
existing open-source vision models, and employs classifier-free guidance to
incorporate the additional object grounding features to improve the precision
of LVLMs' generations. Through comprehensive evaluations across $6$ popular
LVLMs with diverse evaluation metrics, we demonstrate the effectiveness of
MARINE, which even outperforms existing fine-tuning-based methods. Remarkably,
it not only reduces hallucinations but also improves the detailedness of LVLMs'
generations, as assessed by GPT-4V.
Related papers
- FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback [16.24562885483636]
We propose an innovative method to align modalities in Large Vision-Language Models (LVLMs) through Fine-Grained Artificial Intelligence Feedback (FGAIF)
Specifically, we first utilize AI tools to predict the types of hallucination for each segment in the response and obtain a collection of fine-grained feedback. Then, based on the collected reward data, three specialized reward models are trained to produce dense rewards. Finally, a novel fine-grained feedback module is integrated into the Proximal Policy Optimization (PPO) algorithm.
arXiv Detail & Related papers (2024-04-07T19:00:45Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept
Recognition in Large Vision Language Models [68.46457611340097]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - Unmemorization in Large Language Models via Self-Distillation and
Deliberate Imagination [58.36408867180233]
Large Language Models (LLMs) struggle with crucial issues of privacy violation and unwanted exposure of sensitive data.
We introduce a novel approach termed deliberate imagination in the context of LLM unlearning.
Our results demonstrate the usefulness of this approach across different models and sizes, and also with parameter-efficient fine-tuning.
arXiv Detail & Related papers (2024-02-15T16:21:14Z) - MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [27.930351465266515]
We propose a simple yet effective training strategy MoE-Tuning for LVLMs.
MoE-LLaVA, a MoE-based sparse LVLM architecture, uniquely activates only the top-k experts through routers.
Experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks.
arXiv Detail & Related papers (2024-01-29T08:13:40Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - CIEM: Contrastive Instruction Evaluation Method for Better Instruction
Tuning [8.217445461627797]
Vision-Language Models (VLMs) may generate incorrect perception information when doing downstream applications, for example, captioning a non-existent entity.
To address the hallucination phenomenon, we introduce a Contrastive Instruction Evaluation Method (CIEM) and Contrastive Instruction Tuning (CIT)
We pinpoint the hallucination issues commonly present in existing VLMs, the disability of the current instruction-tuning dataset to handle the hallucination phenomenon and the superiority of CIT-tuned VLMs over both CIEM and public datasets.
arXiv Detail & Related papers (2023-09-05T15:06:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.