Uncertainty-Aware Evaluation for Vision-Language Models
- URL: http://arxiv.org/abs/2402.14418v2
- Date: Sat, 24 Feb 2024 12:30:40 GMT
- Title: Uncertainty-Aware Evaluation for Vision-Language Models
- Authors: Vasily Kostumov, Bulat Nutfullin, Oleg Pilipenko, Eugene Ilyushin
- Abstract summary: Current evaluation methods overlook an essential component: uncertainty.
We show that models with the highest accuracy may also have the highest uncertainty.
Our empirical findings also reveal a correlation between model uncertainty and its language model part.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-Language Models like GPT-4, LLaVA, and CogVLM have surged in
popularity recently due to their impressive performance in several
vision-language tasks. Current evaluation methods, however, overlook an
essential component: uncertainty, which is crucial for a comprehensive
assessment of VLMs. Addressing this oversight, we present a benchmark
incorporating uncertainty quantification into evaluating VLMs.
Our analysis spans 20+ VLMs, focusing on the multiple-choice Visual Question
Answering (VQA) task. We examine models on 5 datasets that evaluate various
vision-language capabilities.
Using conformal prediction as an uncertainty estimation approach, we
demonstrate that the models' uncertainty is not aligned with their accuracy.
Specifically, we show that models with the highest accuracy may also have the
highest uncertainty, which confirms the importance of measuring it for VLMs.
Our empirical findings also reveal a correlation between model uncertainty and
its language model part.
Related papers
- VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models [6.9060054915724]
Language and Vision-Language Models (LLMs/VLMs) have revolutionized the field of AI by their ability to generate human-like text and understand images, but ensuring their reliability is crucial.
This paper aims to evaluate the ability of LLMs (GPT4, GPT-3.5, LLaMA2, and PaLM 2) and VLMs (GPT4V and Gemini Pro Vision) to estimate their verbalized uncertainty via prompting.
We propose the new Japanese Uncertain Scenes dataset aimed at testing VLM capabilities via difficult queries and object counting, and the Net Error dataset to measure direction of miscalibration.
arXiv Detail & Related papers (2024-05-05T12:51:38Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Multi-Perspective Consistency Enhances Confidence Estimation in Large
Language Models [27.63938857490995]
This work focuses on improving the confidence estimation of large language models.
Considering the fragility of self-awareness in language models, we introduce a Multi-Perspective Consistency (MPC) method.
The experimental results on eight publicly available datasets show that our MPC achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-02-17T13:37:39Z) - Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs.
Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - Plex: Towards Reliability using Pretrained Large Model Extensions [69.13326436826227]
We develop ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively.
Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol.
We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples.
arXiv Detail & Related papers (2022-07-15T11:39:37Z) - VALSE: A Task-Independent Benchmark for Vision and Language Models
Centered on Linguistic Phenomena [15.984927623688915]
VALSE (Vision And Language Structured Evaluation) is a novel benchmark for testing general-purpose pretrained vision and language (V&L) models.
VALSE offers a suite of six tests covering various linguistic constructs.
We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models.
arXiv Detail & Related papers (2021-12-14T17:15:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.