Related papers: VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models

VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models

URL: http://arxiv.org/abs/2312.04087v1
Date: Thu, 7 Dec 2023 06:53:55 GMT
Title: VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models
Authors: Zongjie Li, Chaozheng Wang, Chaowei Liu, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao
Abstract summary: We conduct the first comprehensive analysis of Large Multimodal Models (LMMs) using a variety of visual referring prompting strategies. We develop an automated assessment framework to evaluate the accuracy of LMMs without the need for human intervention or manual labeling. We find that the current proprietary models generally outperform the open-source ones, showing an average accuracy improvement of 22.70%.
Score: 19.32035955420203
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With recent advancements in Large Multimodal Models (LMMs) across various domains, a novel prompting method called visual referring prompting has emerged, showing significant potential in enhancing human-computer interaction within multimodal systems. This method offers a more natural and flexible approach to human interaction with these systems compared to traditional text descriptions or coordinates. However, the categorization of visual referring prompting remains undefined, and its impact on the performance of LMMs has yet to be formally examined. In this study, we conduct the first comprehensive analysis of LMMs using a variety of visual referring prompting strategies. We introduce a benchmark dataset called VRPTEST, comprising 3 different visual tasks and 2,275 images, spanning diverse combinations of prompt strategies. Using VRPTEST, we conduct a comprehensive evaluation of eight versions of prominent open-source and proprietary foundation models, including two early versions of GPT-4V. We develop an automated assessment framework based on software metamorphic testing techniques to evaluate the accuracy of LMMs without the need for human intervention or manual labeling. We find that the current proprietary models generally outperform the open-source ones, showing an average accuracy improvement of 22.70%; however, there is still potential for improvement. Moreover, our quantitative analysis shows that the choice of prompt strategy significantly affects the accuracy of LMMs, with variations ranging from -17.5% to +7.3%. Further case studies indicate that an appropriate visual referring prompting strategy can improve LMMs' understanding of context and location information, while an unsuitable one might lead to answer rejection. We also provide insights on minimizing the negative impact of visual referring prompting on LMMs.

Related papers

Just Noticeable Difference for Large Multimodal Models [70.41467229325345]
Just noticeable difference (JND) is the minimum change that the human visual system (HVS) can perceive.<n>We take an initial attempt and demonstrate that there exist significant visual blind spots in current LMMs.<n>Our research underscores the significance of LMM-JND as a unique perspective for studying LMMs.
arXiv Detail & Related papers (2025-07-01T07:06:32Z)
Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings [14.065907685322097]
We present a comprehensive evaluation of the automated deception detection capabilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs)<n>We assess the performance of both open-source and commercial LLMs on three distinct datasets: real life trial interviews (RLTD), instructed deception in interpersonal scenarios (MU3D), and deceptive reviews (OpSpam)<n>Our results show that fine-tuned LLMs achieve state-of-the-art performance on textual deception detection tasks, while LMMs struggle to fully leverage cross-modal cues.
arXiv Detail & Related papers (2025-06-11T06:12:50Z)
Towards Understanding Graphical Perception in Large Multimodal Models [80.44471730672801]
We leverage the theory of graphical perception to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three levels (chart, visual element, and pixel)
arXiv Detail & Related papers (2025-03-13T20:13:39Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models [10.828419851213528]
We propose the Multi-Dimensional Insights benchmark, which includes over 500 images covering six common scenarios of human life. This design allows for a detailed assessment of LMMs' capabilities in meeting the preferences and needs of different age groups. Looking ahead, we anticipate that the MDI-Benchmark will open new pathways for aligning real-world personalization in LMMs.
arXiv Detail & Related papers (2024-12-17T07:06:10Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs) MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
LLaVA-Critic: Learning to Evaluate Multimodal Models [110.06665155812162]
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios.
arXiv Detail & Related papers (2024-10-03T17:36:33Z)
Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks. Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs. We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z)
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study [32.57246173437492]
This paper presents an empirical study on enhancing MLLMs with state-of-the-art (SOTA) object detection and Optical Character Recognition (OCR) models to improve fine-grained understanding and reduce hallucination in responses. We conduct systematic and extensive experiments with representative models such as LLaVA-1.5, DINO, PaddleOCRv2, and Grounding DINO. Notably, the enhanced LLaVA-1.5 outperforms its original 7B/13B models on all 10 benchmarks, achieving an improvement of up to 12.5% on the normalized average score.
arXiv Detail & Related papers (2024-01-31T16:38:32Z)
Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs) We first build a vision-language feedback dataset utilizing AI annotation. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z)
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks [81.2935966933355]
We study the impact of visual adversarial attacks on Large Multimodal Models (LMMs) We find that in general LMMs are not robust to visual adversarial inputs. We propose a new approach to real-world image classification which we term query decomposition.
arXiv Detail & Related papers (2023-12-06T04:59:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.