Related papers: Exploring Perceptual Limitation of Multimodal Large Language Models

Exploring Perceptual Limitation of Multimodal Large Language Models

URL: http://arxiv.org/abs/2402.07384v1
Date: Mon, 12 Feb 2024 03:04:42 GMT
Title: Exploring Perceptual Limitation of Multimodal Large Language Models
Authors: Jiarui Zhang, Jinyi Hu, Mahyar Khayatkhoei, Filip Ilievski, Maosong Sun
Abstract summary: We quantitatively study the perception of small visual objects in several state-of-the-art MLLMs. We identify four independent factors that can contribute to this limitation. Lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions.
Score: 57.567868157293994
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have recently shown remarkable perceptual capability in answering visual questions, however, little is known about the limits of their perception. In particular, while prior works have provided anecdotal evidence of MLLMs' sensitivity to object size, this phenomenon and its underlying causes have not been explored comprehensively. In this work, we quantitatively study the perception of small visual objects in several state-of-the-art MLLMs and reveal a pervasive limitation in answering questions about small objects in images. Next, we identify four independent factors that can contribute to this limitation -- object quality, size, distractors, and location -- and conduct controlled intervention studies to measure the effect of each factor on MLLMs' perception. In particular, we find that lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions. More surprisingly, we find that the location of the object in the image and the presence of visual distractors can also significantly reduce MLLMs' question answering accuracy. Our study provides a better understanding of the perceptual limitation of MLLMs and contributes new evaluation protocols for analyzing the perception of future MLLMs. To facilitate further investigations, we release our code and data.

Related papers

Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT) GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z)
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs [11.532430076027554]
We study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself.
arXiv Detail & Related papers (2025-02-24T18:54:40Z)
Benchmarking Large and Small MLLMs [71.78055760441256]
Large multimodal language models (MLLMs) have achieved remarkable advancements in understanding and generating multimodal content. However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications. Small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offer promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios.
arXiv Detail & Related papers (2025-01-04T07:44:49Z)
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks. This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z)
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects? [19.525612393979777]
We introduce a novel benchmark, NEMO, which comprises 900 images of origiNal fruits and their corresponding attributE-MOdified ones. We assess 26 recent open-sourced and commercial models using our benchmark. The findings highlight pronounced performance gaps in recognizing objects in NEMO and reveal distinct answer preferences across different models.
arXiv Detail & Related papers (2024-11-26T17:47:34Z)
Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks. We present QL-Bench, a benchmark settings to simulate human responses to low-level vision. We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z)
Understanding Information Storage and Transfer in Multi-modal Large Language Models [51.20840103605018]
We study how Multi-modal Large Language Models process information in a factual visual question answering task. Key findings show that these MLLMs rely on self-attention blocks in much earlier layers for information storage. We introduce MultEdit, a model-editing algorithm that can correct errors and insert new long-tailed information into MLLMs.
arXiv Detail & Related papers (2024-06-06T16:35:36Z)
Hallucination of Multimodal Large Language Models: A Survey [40.73148186369018]
multimodal large language models (MLLMs) have demonstrated significant advancements and remarkable abilities in multimodal tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content. This survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field.
arXiv Detail & Related papers (2024-04-29T17:59:41Z)
Rethinking Interpretability in the Era of Large Language Models [76.1947554386879]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks. The capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human. These new capabilities raise new challenges, such as hallucinated explanations and immense computational costs.
arXiv Detail & Related papers (2024-01-30T17:38:54Z)
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences [80.54979242912944]
This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
arXiv Detail & Related papers (2024-01-19T07:10:13Z)
MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception [21.60103376506254]
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual perception and understanding. These models also suffer from hallucinations, which limit their reliability as AI systems. This paper aims to define and evaluate the self-awareness of MLLMs in perception.
arXiv Detail & Related papers (2024-01-15T08:19:22Z)
Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs [12.598351373932234]
We investigate whether MLLMs can perceive small details as well as large details in images. We show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question. We propose five automatic visual cropping methods to improve the zero-shot performance of MLLMs.
arXiv Detail & Related papers (2023-10-24T17:48:04Z)
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models. We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.