VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal
Models
- URL: http://arxiv.org/abs/2312.04087v1
- Date: Thu, 7 Dec 2023 06:53:55 GMT
- Title: VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal
Models
- Authors: Zongjie Li, Chaozheng Wang, Chaowei Liu, Pingchuan Ma, Daoyuan Wu,
Shuai Wang, Cuiyun Gao
- Abstract summary: We conduct the first comprehensive analysis of Large Multimodal Models (LMMs) using a variety of visual referring prompting strategies.
We develop an automated assessment framework to evaluate the accuracy of LMMs without the need for human intervention or manual labeling.
We find that the current proprietary models generally outperform the open-source ones, showing an average accuracy improvement of 22.70%.
- Score: 19.32035955420203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With recent advancements in Large Multimodal Models (LMMs) across various
domains, a novel prompting method called visual referring prompting has
emerged, showing significant potential in enhancing human-computer interaction
within multimodal systems. This method offers a more natural and flexible
approach to human interaction with these systems compared to traditional text
descriptions or coordinates. However, the categorization of visual referring
prompting remains undefined, and its impact on the performance of LMMs has yet
to be formally examined. In this study, we conduct the first comprehensive
analysis of LMMs using a variety of visual referring prompting strategies. We
introduce a benchmark dataset called VRPTEST, comprising 3 different visual
tasks and 2,275 images, spanning diverse combinations of prompt strategies.
Using VRPTEST, we conduct a comprehensive evaluation of eight versions of
prominent open-source and proprietary foundation models, including two early
versions of GPT-4V. We develop an automated assessment framework based on
software metamorphic testing techniques to evaluate the accuracy of LMMs
without the need for human intervention or manual labeling. We find that the
current proprietary models generally outperform the open-source ones, showing
an average accuracy improvement of 22.70%; however, there is still potential
for improvement. Moreover, our quantitative analysis shows that the choice of
prompt strategy significantly affects the accuracy of LMMs, with variations
ranging from -17.5% to +7.3%. Further case studies indicate that an appropriate
visual referring prompting strategy can improve LMMs' understanding of context
and location information, while an unsuitable one might lead to answer
rejection. We also provide insights on minimizing the negative impact of visual
referring prompting on LMMs.
Related papers
- MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs [55.74117540987519]
This paper explores the problem of commonsense-level vision-knowledge conflict in Multimodal Large Language Models (MLLMs)
We introduce an automated pipeline, augmented with human-in-the-loop quality control, to establish a benchmark aimed at simulating and assessing the conflicts in MLLMs.
We evaluate the conflict-resolution capabilities of nine representative MLLMs across various model families and find a noticeable over-reliance on textual queries.
arXiv Detail & Related papers (2024-10-10T17:31:17Z) - LLaVA-Critic: Learning to Evaluate Multimodal Models [110.06665155812162]
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator.
LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios.
arXiv Detail & Related papers (2024-10-03T17:36:33Z) - X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation [47.96737683498274]
Large Language Models (LLMs) and Large Multimodal Models (LMMs) have been shown to enhance the effectiveness of enriching item descriptions.
This paper introduces a novel framework, Cross-Reflection Prompting, termed X-Reflect, to address limitations by prompting LMMs to explicitly identify and reconcile supportive and conflicting information between text and images.
arXiv Detail & Related papers (2024-08-27T16:10:21Z) - Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks.
Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs.
We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z) - Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study [32.57246173437492]
This paper presents an empirical study on enhancing MLLMs with state-of-the-art (SOTA) object detection and Optical Character Recognition (OCR) models to improve fine-grained understanding and reduce hallucination in responses.
We conduct systematic and extensive experiments with representative models such as LLaVA-1.5, DINO, PaddleOCRv2, and Grounding DINO.
Notably, the enhanced LLaVA-1.5 outperforms its original 7B/13B models on all 10 benchmarks, achieving an improvement of up to 12.5% on the normalized average score.
arXiv Detail & Related papers (2024-01-31T16:38:32Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - On the Robustness of Large Multimodal Models Against Image Adversarial
Attacks [81.2935966933355]
We study the impact of visual adversarial attacks on Large Multimodal Models (LMMs)
We find that in general LMMs are not robust to visual adversarial inputs.
We propose a new approach to real-world image classification which we term query decomposition.
arXiv Detail & Related papers (2023-12-06T04:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.