VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal
Models
- URL: http://arxiv.org/abs/2312.04087v1
- Date: Thu, 7 Dec 2023 06:53:55 GMT
- Title: VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal
Models
- Authors: Zongjie Li, Chaozheng Wang, Chaowei Liu, Pingchuan Ma, Daoyuan Wu,
Shuai Wang, Cuiyun Gao
- Abstract summary: We conduct the first comprehensive analysis of Large Multimodal Models (LMMs) using a variety of visual referring prompting strategies.
We develop an automated assessment framework to evaluate the accuracy of LMMs without the need for human intervention or manual labeling.
We find that the current proprietary models generally outperform the open-source ones, showing an average accuracy improvement of 22.70%.
- Score: 19.32035955420203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With recent advancements in Large Multimodal Models (LMMs) across various
domains, a novel prompting method called visual referring prompting has
emerged, showing significant potential in enhancing human-computer interaction
within multimodal systems. This method offers a more natural and flexible
approach to human interaction with these systems compared to traditional text
descriptions or coordinates. However, the categorization of visual referring
prompting remains undefined, and its impact on the performance of LMMs has yet
to be formally examined. In this study, we conduct the first comprehensive
analysis of LMMs using a variety of visual referring prompting strategies. We
introduce a benchmark dataset called VRPTEST, comprising 3 different visual
tasks and 2,275 images, spanning diverse combinations of prompt strategies.
Using VRPTEST, we conduct a comprehensive evaluation of eight versions of
prominent open-source and proprietary foundation models, including two early
versions of GPT-4V. We develop an automated assessment framework based on
software metamorphic testing techniques to evaluate the accuracy of LMMs
without the need for human intervention or manual labeling. We find that the
current proprietary models generally outperform the open-source ones, showing
an average accuracy improvement of 22.70%; however, there is still potential
for improvement. Moreover, our quantitative analysis shows that the choice of
prompt strategy significantly affects the accuracy of LMMs, with variations
ranging from -17.5% to +7.3%. Further case studies indicate that an appropriate
visual referring prompting strategy can improve LMMs' understanding of context
and location information, while an unsuitable one might lead to answer
rejection. We also provide insights on minimizing the negative impact of visual
referring prompting on LMMs.
Related papers
- Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks.
Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs.
We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual words, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study [32.57246173437492]
This paper presents an empirical study on enhancing MLLMs with state-of-the-art (SOTA) object detection and Optical Character Recognition (OCR) models to improve fine-grained understanding and reduce hallucination in responses.
We conduct systematic and extensive experiments with representative models such as LLaVA-1.5, DINO, PaddleOCRv2, and Grounding DINO.
Notably, the enhanced LLaVA-1.5 outperforms its original 7B/13B models on all 10 benchmarks, achieving an improvement of up to 12.5% on the normalized average score.
arXiv Detail & Related papers (2024-01-31T16:38:32Z) - Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and
Symptom Analysis [2.4554686192257424]
Large language models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence technology.
We evaluate the correctness and accuracy of LLM-generated medical diagnosis with publicly available multimodal multiple-choice questions.
We explored a wide range of diseases, conditions, chemical compounds, and related entity types that are included in the vast knowledge domain of Pathology.
arXiv Detail & Related papers (2024-01-28T09:25:12Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - On the Robustness of Large Multimodal Models Against Image Adversarial
Attacks [81.2935966933355]
We study the impact of visual adversarial attacks on Large Multimodal Models (LMMs)
We find that in general LMMs are not robust to visual adversarial inputs.
We propose a new approach to real-world image classification which we term query decomposition.
arXiv Detail & Related papers (2023-12-06T04:59:56Z) - Interactive Hyperparameter Optimization in Multi-Objective Problems via
Preference Learning [65.51668094117802]
We propose a human-centered interactive HPO approach tailored towards multi-objective machine learning (ML)
Instead of relying on the user guessing the most suitable indicator for their needs, our approach automatically learns an appropriate indicator.
arXiv Detail & Related papers (2023-09-07T09:22:05Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.