Uncertainty-Aware Evaluation for Vision-Language Models
- URL: http://arxiv.org/abs/2402.14418v2
- Date: Sat, 24 Feb 2024 12:30:40 GMT
- Title: Uncertainty-Aware Evaluation for Vision-Language Models
- Authors: Vasily Kostumov, Bulat Nutfullin, Oleg Pilipenko, Eugene Ilyushin
- Abstract summary: Current evaluation methods overlook an essential component: uncertainty.
We show that models with the highest accuracy may also have the highest uncertainty.
Our empirical findings also reveal a correlation between model uncertainty and its language model part.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-Language Models like GPT-4, LLaVA, and CogVLM have surged in
popularity recently due to their impressive performance in several
vision-language tasks. Current evaluation methods, however, overlook an
essential component: uncertainty, which is crucial for a comprehensive
assessment of VLMs. Addressing this oversight, we present a benchmark
incorporating uncertainty quantification into evaluating VLMs.
Our analysis spans 20+ VLMs, focusing on the multiple-choice Visual Question
Answering (VQA) task. We examine models on 5 datasets that evaluate various
vision-language capabilities.
Using conformal prediction as an uncertainty estimation approach, we
demonstrate that the models' uncertainty is not aligned with their accuracy.
Specifically, we show that models with the highest accuracy may also have the
highest uncertainty, which confirms the importance of measuring it for VLMs.
Our empirical findings also reveal a correlation between model uncertainty and
its language model part.
Related papers
- To Trust Or Not To Trust Your Vision-Language Model's Prediction [37.90196640800147]
We introduce TrustVLM, a training-free framework designed to address the challenge of estimating when VLM's predictions can be trusted.<n>Motivated by the observed modality gap in VLMs, we propose a novel confidence-scoring function that leverages this space to improve misclassification detection.<n>We rigorously evaluate our approach across 17 diverse datasets, employing 4 architectures and 2 VLMs, and demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-05-29T17:59:01Z) - Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models [15.158475816860427]
Uncertainty is essential for assessing the reliability and trustworthiness of modern AI systems.<n> verbalized uncertainty, where models express their confidence through natural language, has emerged as a lightweight and interpretable solution.<n>However, its effectiveness in vision-language models (VLMs) remains insufficiently studied.
arXiv Detail & Related papers (2025-05-26T17:16:36Z) - Post-hoc Probabilistic Vision-Language Models [51.12284891724463]
Vision-language models (VLMs) have found remarkable success in classification, retrieval, and generative tasks.
We propose post-hoc uncertainty estimation in VLMs that does not require additional training.
Our results show promise for safety-critical applications of large-scale models.
arXiv Detail & Related papers (2024-12-08T18:16:13Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models [6.9060054915724]
Language and Vision-Language Models (LLMs/VLMs) have revolutionized the field of AI by their ability to generate human-like text and understand images, but ensuring their reliability is crucial.
This paper aims to evaluate the ability of LLMs (GPT4, GPT-3.5, LLaMA2, and PaLM 2) and VLMs (GPT4V and Gemini Pro Vision) to estimate their verbalized uncertainty via prompting.
We propose the new Japanese Uncertain Scenes dataset aimed at testing VLM capabilities via difficult queries and object counting, and the Net Error dataset to measure direction of miscalibration.
arXiv Detail & Related papers (2024-05-05T12:51:38Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Multi-Perspective Consistency Enhances Confidence Estimation in Large
Language Models [27.63938857490995]
This work focuses on improving the confidence estimation of large language models.
Considering the fragility of self-awareness in language models, we introduce a Multi-Perspective Consistency (MPC) method.
The experimental results on eight publicly available datasets show that our MPC achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-02-17T13:37:39Z) - Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs.
Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - Plex: Towards Reliability using Pretrained Large Model Extensions [69.13326436826227]
We develop ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively.
Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol.
We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples.
arXiv Detail & Related papers (2022-07-15T11:39:37Z) - VALSE: A Task-Independent Benchmark for Vision and Language Models
Centered on Linguistic Phenomena [15.984927623688915]
VALSE (Vision And Language Structured Evaluation) is a novel benchmark for testing general-purpose pretrained vision and language (V&L) models.
VALSE offers a suite of six tests covering various linguistic constructs.
We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models.
arXiv Detail & Related papers (2021-12-14T17:15:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.