Towards Perceiving Small Visual Details in Zero-shot Visual Question
Answering with Multimodal LLMs
- URL: http://arxiv.org/abs/2310.16033v3
- Date: Mon, 12 Feb 2024 05:00:09 GMT
- Title: Towards Perceiving Small Visual Details in Zero-shot Visual Question
Answering with Multimodal LLMs
- Authors: Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
- Abstract summary: We investigate whether MLLMs can perceive small details as well as large details in images.
We show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question.
We propose five automatic visual cropping methods to improve the zero-shot performance of MLLMs.
- Score: 12.598351373932234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have recently achieved promising
zero-shot accuracy on visual question answering (VQA) -- a fundamental task
affecting various downstream applications and domains. Given the great
potential for the broad use of these models, it is important to investigate
their limitations in dealing with different image and question properties. In
this work, we investigate whether MLLMs can perceive small details as well as
large details in images. In particular, we show that their zero-shot accuracy
in answering visual questions is very sensitive to the size of the visual
subject of the question, declining up to 46% with size. Furthermore, we show
that this effect is causal by observing that human visual cropping can
significantly mitigate their sensitivity to size. Inspired by the usefulness of
human cropping, we then propose five automatic visual cropping methods --
leveraging either external localization models or the decision process of the
given MLLM itself -- as inference time mechanisms to improve the zero-shot
performance of MLLMs. We study their effectiveness on four popular VQA
datasets, and a subset of the VQAv2 dataset tailored towards fine visual
details. Our findings suggest that MLLMs should be used with caution in
detail-sensitive VQA applications, and that visual cropping is a promising
direction to improve their zero-shot performance. To facilitate further
investigation of MLLMs' behaviors, our code and data are publicly released.
Related papers
- Rethinking VLMs and LLMs for Image Classification [6.550471260627169]
Large Language Models (LLMs) are increasingly being merged with Visual Language Models (VLMs) to enable new capabilities.
We show that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do.
We propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task.
arXiv Detail & Related papers (2024-10-03T23:40:21Z) - Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models [61.899791071654654]
We introduce a benchmark, Q-Spatial Bench, with 271 questions across five categories designed for quantitative spatial reasoning.
We investigate the performance of state-of-the-art vision-language models (VLMs) on this task.
We develop a zero-shot prompting technique, SpatialPrompt, that encourages VLMs to answer quantitative spatial questions using reference objects as visual cues.
arXiv Detail & Related papers (2024-09-15T16:45:42Z) - Understanding Information Storage and Transfer in Multi-modal Large Language Models [51.20840103605018]
We study how Multi-modal Large Language Models process information in a factual visual question answering task.
Key findings show that these MLLMs rely on self-attention blocks in much earlier layers for information storage.
We introduce MultEdit, a model-editing algorithm that can correct errors and insert new long-tailed information into MLLMs.
arXiv Detail & Related papers (2024-06-06T16:35:36Z) - Exploring Perceptual Limitation of Multimodal Large Language Models [57.567868157293994]
We quantitatively study the perception of small visual objects in several state-of-the-art MLLMs.
We identify four independent factors that can contribute to this limitation.
Lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions.
arXiv Detail & Related papers (2024-02-12T03:04:42Z) - Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision.
We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs.
We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z) - Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
Reasoning over Image Sequences [80.54979242912944]
This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities.
We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
arXiv Detail & Related papers (2024-01-19T07:10:13Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.