FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
- URL: http://arxiv.org/abs/2311.01477v1
- Date: Thu, 2 Nov 2023 01:21:45 GMT
- Title: FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
- Authors: Liqiang Jing and Ruosen Li and Yunmo Chen and Mengzhao Jia and Xinya
Du
- Abstract summary: We introduce FAITHSCORE, a reference-free and fine-grained evaluation metric that measures the faithfulness of the generated free-form answers from large vision-language models (LVLMs)
We measure hallucinations in state-of-the-art LVLMs with FAITHSCORE on the datasets.
Results reveal that current systems are prone to generate hallucinated content unfaithful to the image, which leaves room for future improvements.
- Score: 17.9443875180437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce FAITHSCORE (Faithfulness to Atomic Image Facts Score), a
reference-free and fine-grained evaluation metric that measures the
faithfulness of the generated free-form answers from large vision-language
models (LVLMs). The FAITHSCORE evaluation first identifies sub-sentences
containing descriptive statements that need to be verified, then extracts a
comprehensive list of atomic facts from these sub-sentences, and finally
conducts consistency verification between fine-grained atomic facts and the
input image. Meta-evaluation demonstrates that our metric highly correlates
with human judgments of faithfulness. We collect two benchmark datasets (i.e.
LLaVA-1k and MSCOCO-Cap) for evaluating LVLMs instruction-following
hallucinations. We measure hallucinations in state-of-the-art LVLMs with
FAITHSCORE on the datasets. Results reveal that current systems are prone to
generate hallucinated content unfaithful to the image, which leaves room for
future improvements. Further, we find that current LVLMs despite doing well on
color and counting, still struggle with long answers, relations, and multiple
objects.
Related papers
- Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models [67.89204055004028]
We propose a Hallucination benchmark Quality Measurement framework (HQM) to assess the reliability and validity of existing hallucination benchmarks.
We conduct an extensive evaluation of over 10 representative LVLMs, including GPT-4o and Gemini-Vision-Pro, to provide an in-depth analysis of the hallucination issues in existing models.
arXiv Detail & Related papers (2024-06-24T20:08:07Z) - Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning? [29.237078890377514]
Large Vision-Language Models (LVLMs) excel in integrating visual and linguistic contexts to produce detailed content.
Using LVLMs to generate descriptions often faces the challenge of object hallucination (OH), where the output text misrepresents actual objects in the input image.
This paper proposes a novel decoding strategy, Differentiated Beam Decoding (DBD), along with a reliable new set of evaluation metrics.
arXiv Detail & Related papers (2024-06-18T14:33:56Z) - MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification [1.3654846342364308]
We introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost.
Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs which have been overseen in previous works.
We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2024-05-29T15:28:42Z) - Mitigating Object Hallucination via Data Augmented Contrastive Tuning [52.43197107069751]
Multimodal Large Language Models (MLLMs) tend to hallucinate factually inaccurate information.
We introduce a contrastive tuning method that can be applied to a pretrained off-the-shelf MLLM for mitigating hallucinations.
arXiv Detail & Related papers (2024-05-28T23:36:00Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization [29.49641083851667]
We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes.
We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences.
arXiv Detail & Related papers (2024-02-20T18:58:49Z) - Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields.
LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations.
We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z) - Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages.
LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images.
We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z) - Detecting and Preventing Hallucinations in Large Vision Language Models [4.7264116948935975]
M-HalDetect is the first multi-modal hallucination detection dataset for detailed image descriptions.
We train fine-grained multi-modal reward models from InstructBLIP and evaluate their effectiveness with best-of-n rejection sampling.
We find that our reward model generalizes to other multi-modal models, reducing hallucinations in LLaVA and mPLUG-OWL by 15% and 57% respectively.
arXiv Detail & Related papers (2023-08-11T21:35:20Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.