Exploring Perceptual Limitation of Multimodal Large Language Models
- URL: http://arxiv.org/abs/2402.07384v1
- Date: Mon, 12 Feb 2024 03:04:42 GMT
- Title: Exploring Perceptual Limitation of Multimodal Large Language Models
- Authors: Jiarui Zhang, Jinyi Hu, Mahyar Khayatkhoei, Filip Ilievski, Maosong
Sun
- Abstract summary: We quantitatively study the perception of small visual objects in several state-of-the-art MLLMs.
We identify four independent factors that can contribute to this limitation.
Lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions.
- Score: 57.567868157293994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have recently shown remarkable
perceptual capability in answering visual questions, however, little is known
about the limits of their perception. In particular, while prior works have
provided anecdotal evidence of MLLMs' sensitivity to object size, this
phenomenon and its underlying causes have not been explored comprehensively. In
this work, we quantitatively study the perception of small visual objects in
several state-of-the-art MLLMs and reveal a pervasive limitation in answering
questions about small objects in images. Next, we identify four independent
factors that can contribute to this limitation -- object quality, size,
distractors, and location -- and conduct controlled intervention studies to
measure the effect of each factor on MLLMs' perception. In particular, we find
that lower object quality and smaller object size can both independently reduce
MLLMs' ability to answer visual questions. More surprisingly, we find that the
location of the object in the image and the presence of visual distractors can
also significantly reduce MLLMs' question answering accuracy. Our study
provides a better understanding of the perceptual limitation of MLLMs and
contributes new evaluation protocols for analyzing the perception of future
MLLMs. To facilitate further investigations, we release our code and data.
Related papers
- Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks.
We present QL-Bench, a benchmark settings to simulate human responses to low-level vision.
We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z) - Understanding Information Storage and Transfer in Multi-modal Large Language Models [51.20840103605018]
We study how Multi-modal Large Language Models process information in a factual visual question answering task.
Key findings show that these MLLMs rely on self-attention blocks in much earlier layers for information storage.
We introduce MultEdit, a model-editing algorithm that can correct errors and insert new long-tailed information into MLLMs.
arXiv Detail & Related papers (2024-06-06T16:35:36Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Hallucination of Multimodal Large Language Models: A Survey [40.73148186369018]
multimodal large language models (MLLMs) have demonstrated significant advancements and remarkable abilities in multimodal tasks.
Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content.
This survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field.
arXiv Detail & Related papers (2024-04-29T17:59:41Z) - Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
Reasoning over Image Sequences [80.54979242912944]
This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities.
We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
arXiv Detail & Related papers (2024-01-19T07:10:13Z) - MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception [21.60103376506254]
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual perception and understanding.
These models also suffer from hallucinations, which limit their reliability as AI systems.
This paper aims to define and evaluate the self-awareness of MLLMs in perception.
arXiv Detail & Related papers (2024-01-15T08:19:22Z) - Towards Perceiving Small Visual Details in Zero-shot Visual Question
Answering with Multimodal LLMs [12.598351373932234]
We investigate whether MLLMs can perceive small details as well as large details in images.
We show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question.
We propose five automatic visual cropping methods to improve the zero-shot performance of MLLMs.
arXiv Detail & Related papers (2023-10-24T17:48:04Z) - Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level
Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models.
We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.