FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
- URL: http://arxiv.org/abs/2311.01477v1
- Date: Thu, 2 Nov 2023 01:21:45 GMT
- Title: FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
- Authors: Liqiang Jing and Ruosen Li and Yunmo Chen and Mengzhao Jia and Xinya
Du
- Abstract summary: We introduce FAITHSCORE, a reference-free and fine-grained evaluation metric that measures the faithfulness of the generated free-form answers from large vision-language models (LVLMs)
We measure hallucinations in state-of-the-art LVLMs with FAITHSCORE on the datasets.
Results reveal that current systems are prone to generate hallucinated content unfaithful to the image, which leaves room for future improvements.
- Score: 17.9443875180437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce FAITHSCORE (Faithfulness to Atomic Image Facts Score), a
reference-free and fine-grained evaluation metric that measures the
faithfulness of the generated free-form answers from large vision-language
models (LVLMs). The FAITHSCORE evaluation first identifies sub-sentences
containing descriptive statements that need to be verified, then extracts a
comprehensive list of atomic facts from these sub-sentences, and finally
conducts consistency verification between fine-grained atomic facts and the
input image. Meta-evaluation demonstrates that our metric highly correlates
with human judgments of faithfulness. We collect two benchmark datasets (i.e.
LLaVA-1k and MSCOCO-Cap) for evaluating LVLMs instruction-following
hallucinations. We measure hallucinations in state-of-the-art LVLMs with
FAITHSCORE on the datasets. Results reveal that current systems are prone to
generate hallucinated content unfaithful to the image, which leaves room for
future improvements. Further, we find that current LVLMs despite doing well on
color and counting, still struggle with long answers, relations, and multiple
objects.
Related papers
- A Survey of Multimodal Hallucination Evaluation and Detection [52.03164192840023]
Multi-modal Large Language Models (MLLMs) have emerged as a powerful paradigm for integrating visual and textual information.<n>These models often suffer from hallucination, producing content that appears plausible but contradicts the input content or established world knowledge.<n>This survey offers an in-depth review of hallucination evaluation benchmarks and detection methods across Image-to-Text (I2T) and Text-to-image (T2I) generation tasks.
arXiv Detail & Related papers (2025-07-25T07:22:42Z) - HalluLens: LLM Hallucination Benchmark [49.170128733508335]
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination"
This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks.
arXiv Detail & Related papers (2025-04-24T13:40:27Z) - Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks.
DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units.
DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z) - Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models [67.89204055004028]
Large Vision-Language Models (LVLMs) have been plagued by the issue of hallucination.
Previous works have proposed a series of benchmarks featuring different types of tasks and evaluation metrics.
We propose a Hallucination benchmark Quality Measurement framework (HQM) to assess the reliability and validity of existing hallucination benchmarks.
arXiv Detail & Related papers (2024-06-24T20:08:07Z) - HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation [19.318217051269382]
Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP)
HalluDial is the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation.
The benchmark includes 4,094 dialogues with a total of 146,856 samples.
arXiv Detail & Related papers (2024-06-11T08:56:18Z) - MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification [1.3654846342364308]
We introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost.
Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs which have been overseen in previous works.
We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2024-05-29T15:28:42Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization [29.49641083851667]
We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes.
We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences.
arXiv Detail & Related papers (2024-02-20T18:58:49Z) - Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields.
LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations.
We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z) - Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages.
LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images.
We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z) - Evaluation and Analysis of Hallucination in Large Vision-Language Models [49.19829480199372]
Large Vision-Language Models (LVLMs) have recently achieved remarkable success.
LVLMs are still plagued by the hallucination problem.
Hallucination refers to the information of LVLMs' responses that does not exist in the visual input.
arXiv Detail & Related papers (2023-08-29T08:51:24Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.