Related papers: Explore the Hallucination on Low-level Perception for MLLMs

Explore the Hallucination on Low-level Perception for MLLMs

URL: http://arxiv.org/abs/2409.09748v1
Date: Sun, 15 Sep 2024 14:38:29 GMT
Title: Explore the Hallucination on Low-level Perception for MLLMs
Authors: Yinan Sun, Zicheng Zhang, Haoning Wu, Xiaohong Liu, Weisi Lin, Guangtao Zhai, Xiongkuo Min,
Abstract summary: We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks. We present QL-Bench, a benchmark settings to simulate human responses to low-level vision. We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
Score: 83.12180878559295
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid development of Multi-modality Large Language Models (MLLMs) has significantly influenced various aspects of industry and daily life, showcasing impressive capabilities in visual perception and understanding. However, these models also exhibit hallucinations, which limit their reliability as AI systems, especially in tasks involving low-level visual perception and understanding. We believe that hallucinations stem from a lack of explicit self-awareness in these models, which directly impacts their overall performance. In this paper, we aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks. To this end, we present QL-Bench, a benchmark settings to simulate human responses to low-level vision, investigating self-awareness in low-level visual perception through visual question answering related to low-level attributes such as clarity and lighting. Specifically, we construct the LLSAVisionQA dataset, comprising 2,990 single images and 1,999 image pairs, each accompanied by an open-ended question about its low-level features. Through the evaluation of 15 MLLMs, we demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped. Notably, for the same model, simpler questions are often answered more accurately than complex ones. However, self-awareness appears to improve when addressing more challenging questions. We hope that our benchmark will motivate further research, particularly focused on enhancing the self-awareness of MLLMs in tasks involving low-level visual perception and understanding.

Related papers

Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy [53.07517728420411]
We introduce the first instruction database specifically focused on hallucinations in low-level vision tasks. We propose the Self-Awareness Failure Elimination (SAFEQA) model to improve the perception and comprehension abilities of the model in low-level vision tasks. We conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations.
arXiv Detail & Related papers (2025-03-26T16:05:01Z)
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs [11.532430076027554]
We study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself.
arXiv Detail & Related papers (2025-02-24T18:54:40Z)
Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks. These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images. We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z)
Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs) We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context. We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z)
Multi-Object Hallucination in Vision-Language Models [28.135215173793785]
Large vision language models (LVLMs) often suffer from object hallucination. Hallucinatory behaviors are influenced by data-specific factors, salience and frequency, and intrinsic model behaviors.
arXiv Detail & Related papers (2024-07-08T17:59:57Z)
Exploring Perceptual Limitation of Multimodal Large Language Models [57.567868157293994]
We quantitatively study the perception of small visual objects in several state-of-the-art MLLMs. We identify four independent factors that can contribute to this limitation. Lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions.
arXiv Detail & Related papers (2024-02-12T03:04:42Z)
Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision. We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z)
MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception [21.60103376506254]
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual perception and understanding. These models also suffer from hallucinations, which limit their reliability as AI systems. This paper aims to define and evaluate the self-awareness of MLLMs in perception.
arXiv Detail & Related papers (2024-01-15T08:19:22Z)
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z)
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z)
Mitigating Hallucination in Visual Language Models with Visual Supervision [33.05550629039951]
Large vision-language models (LVLMs) suffer from hallucination a lot. Key problem lies in its weak ability to comprehend detailed content in a multi-modal context. In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs.
arXiv Detail & Related papers (2023-11-27T09:30:02Z)
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models. We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.