Related papers: ALOHa: A New Measure for Hallucination in Captioning Models

ALOHa: A New Measure for Hallucination in Captioning Models

URL: http://arxiv.org/abs/2404.02904v1
Date: Wed, 3 Apr 2024 17:59:36 GMT
Title: ALOHa: A New Measure for Hallucination in Captioning Models
Authors: Suzanne Petryk, David M. Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E. Gonzalez, Trevor Darrell,
Abstract summary: Existing metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. We propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations. We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations.
Score: 61.007542765171586
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations. Specifically, we use an LLM to extract groundable objects from a candidate caption, measure their semantic similarity to reference objects from captions and object detections, and use Hungarian matching to produce a final hallucination score. We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations, and 30.8% more on nocaps, where objects extend beyond MS COCO categories. Our code is available at https://davidmchan.github.io/aloha/.

Related papers

Evaluating Hallucination in Large Vision-Language Models based on Context-Aware Object Similarities [5.602853217226167]
We present Context-Aware Object Similarities (CAOS), a novel approach for evaluating object hallucination in large vision-language models (LVLMs) CAOS integrates object statistics with semantic relationships between objects in captions and ground-truth data. To address this, we further employ language model-based object recognition to detect potentially out-of-domain hallucinated objects.
arXiv Detail & Related papers (2025-01-25T03:03:18Z)
Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models [51.50892380172863]
We show that most state-of-the-art MLLMs suffer from severe verb hallucination. We propose a novel rich verb knowledge-based tuning method to mitigate verb hallucination.
arXiv Detail & Related papers (2024-12-06T10:53:47Z)
Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models [22.42712853647949]
We present an in-depth investigation into the object hallucination problem specifically within the CLIP model. We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities. We show the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.
arXiv Detail & Related papers (2024-10-04T06:24:49Z)
Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? [53.89380284760555]
Large vision-language models (LVLMs) produce captions that mention concepts that cannot be found in the image. These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption. Recent work suggests that addition of grounding objectives -- those that explicitly align image regions or objects to text spans -- reduces the amount of LVLM hallucination.
arXiv Detail & Related papers (2024-06-20T16:56:11Z)
Data-augmented phrase-level alignment for mitigating object hallucination [52.43197107069751]
Multimodal Large Language Models (MLLMs) often generate factually inaccurate information, referred to as hallucination. We introduce Data-augmented Phrase-level Alignment (DPA), a novel loss which can be applied to instruction-tuned off-the-shelf MLLMs to mitigate hallucinations.
arXiv Detail & Related papers (2024-05-28T23:36:00Z)
HallE-Control: Controlling Object Hallucination in Large Multimodal Models [80.03697683629035]
We introduce $textitCCEval$, a GPT-4 assisted evaluation method for detailed captioning. While LMMs demonstrate minimal object existence hallucination in existing VQA benchmarks, our proposed evaluation reveals continued susceptibility to such hallucinations. Our method reduces hallucination by 44% compared to LLaVA$_7B$ and maintains the object coverage.
arXiv Detail & Related papers (2023-10-03T04:01:27Z)
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z)
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text. We show that models achieving better scores on standard metrics could hallucinate objects more frequently. Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.