Mitigating Fine-Grained Hallucination by Fine-Tuning Large
Vision-Language Models with Caption Rewrites
- URL: http://arxiv.org/abs/2312.01701v1
- Date: Mon, 4 Dec 2023 07:43:02 GMT
- Title: Mitigating Fine-Grained Hallucination by Fine-Tuning Large
Vision-Language Models with Caption Rewrites
- Authors: Lei Wang, Jiabang He, Shenshen Li, Ning Liu, Ee-Peng Lim
- Abstract summary: We propose textitReCaption, a framework that consists of two components: rewriting captions using ChatGPT and fine-tuning the instruction-tuned LVLMs on the rewritten captions.
Our experiment results demonstrate that ReCaption effectively reduces fine-grained object hallucination for different LVLM options and improves their text generation quality.
- Score: 18.640459366439917
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown remarkable performance in natural
language processing (NLP) tasks. To comprehend and execute diverse human
instructions over image data, instruction-tuned large vision-language models
(LVLMs) have been introduced. However, LVLMs may suffer from different types of
object hallucinations. Nevertheless, LVLMs are evaluated for coarse-grained
object hallucinations only (i.e., generated objects non-existent in the input
image). The fine-grained object attributes and behaviors non-existent in the
image may still be generated but not measured by the current evaluation
methods. In this paper, we thus focus on reducing fine-grained hallucinations
of LVLMs. We propose \textit{ReCaption}, a framework that consists of two
components: rewriting captions using ChatGPT and fine-tuning the
instruction-tuned LVLMs on the rewritten captions. We also propose a
fine-grained probing-based evaluation method named \textit{Fine-Grained Object
Hallucination Evaluation} (\textit{FGHE}). Our experiment results demonstrate
that ReCaption effectively reduces fine-grained object hallucination for
different LVLM options and improves their text generation quality. The code can
be found at https://github.com/Anonymousanoy/FOHE.
Related papers
- MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena.
We propose a novel dynamic correction decoding method for MLLMs (DeCo)
We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines.
arXiv Detail & Related papers (2024-10-15T16:57:44Z) - HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding [36.360171373963716]
Large Vision-Language Models (LVLMs) have shown remarkable performance on many visual-language tasks.
These models still suffer from multimodal hallucination, which means the generation of objects or content that violates the images.
We propose Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD) to address this issue.
arXiv Detail & Related papers (2024-09-30T15:52:05Z) - Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? [53.89380284760555]
Large vision-language models (LVLMs) produce captions that mention concepts that cannot be found in the image.
These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption.
Recent work suggests that addition of grounding objectives -- those that explicitly align image regions or objects to text spans -- reduces the amount of LVLM hallucination.
arXiv Detail & Related papers (2024-06-20T16:56:11Z) - MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification [1.3654846342364308]
We introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost.
Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs which have been overseen in previous works.
We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2024-05-29T15:28:42Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z) - Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages.
LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images.
We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z) - Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.