ALOHa: A New Measure for Hallucination in Captioning Models
- URL: http://arxiv.org/abs/2404.02904v1
- Date: Wed, 3 Apr 2024 17:59:36 GMT
- Title: ALOHa: A New Measure for Hallucination in Captioning Models
- Authors: Suzanne Petryk, David M. Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E. Gonzalez, Trevor Darrell,
- Abstract summary: Existing metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms.
We propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations.
We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations.
- Score: 61.007542765171586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations. Specifically, we use an LLM to extract groundable objects from a candidate caption, measure their semantic similarity to reference objects from captions and object detections, and use Hungarian matching to produce a final hallucination score. We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations, and 30.8% more on nocaps, where objects extend beyond MS COCO categories. Our code is available at https://davidmchan.github.io/aloha/.
Related papers
- Evaluating Hallucination in Large Vision-Language Models based on Context-Aware Object Similarities [5.602853217226167]
We present Context-Aware Object Similarities (CAOS), a novel approach for evaluating object hallucination in large vision-language models (LVLMs)
CAOS integrates object statistics with semantic relationships between objects in captions and ground-truth data.
To address this, we further employ language model-based object recognition to detect potentially out-of-domain hallucinated objects.
arXiv Detail & Related papers (2025-01-25T03:03:18Z) - Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models [51.50892380172863]
We show that most state-of-the-art MLLMs suffer from severe verb hallucination.
We propose a novel rich verb knowledge-based tuning method to mitigate verb hallucination.
arXiv Detail & Related papers (2024-12-06T10:53:47Z) - Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? [53.89380284760555]
Large vision-language models (LVLMs) produce captions that mention concepts that cannot be found in the image.
These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption.
Recent work suggests that addition of grounding objectives -- those that explicitly align image regions or objects to text spans -- reduces the amount of LVLM hallucination.
arXiv Detail & Related papers (2024-06-20T16:56:11Z) - HallE-Control: Controlling Object Hallucination in Large Multimodal Models [80.03697683629035]
We introduce $textitCCEval$, a GPT-4 assisted evaluation method for detailed captioning.
While LMMs demonstrate minimal object existence hallucination in existing VQA benchmarks, our proposed evaluation reveals continued susceptibility to such hallucinations.
Our method reduces hallucination by 44% compared to LLaVA$_7B$ and maintains the object coverage.
arXiv Detail & Related papers (2023-10-03T04:01:27Z) - Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages.
LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images.
We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z) - Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.