Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models
- URL: http://arxiv.org/abs/2405.02917v1
- Date: Sun, 5 May 2024 12:51:38 GMT
- Title: Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models
- Authors: Tobias Groot, Matias Valdenegro-Toro,
- Abstract summary: Language and Vision-Language Models (LLMs/VLMs) have revolutionized the field of AI by their ability to generate human-like text and understand images, but ensuring their reliability is crucial.
This paper aims to evaluate the ability of LLMs (GPT4, GPT-3.5, LLaMA2, and PaLM 2) and VLMs (GPT4V and Gemini Pro Vision) to estimate their verbalized uncertainty via prompting.
We propose the new Japanese Uncertain Scenes dataset aimed at testing VLM capabilities via difficult queries and object counting, and the Net Error dataset to measure direction of miscalibration.
- Score: 6.9060054915724
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Language and Vision-Language Models (LLMs/VLMs) have revolutionized the field of AI by their ability to generate human-like text and understand images, but ensuring their reliability is crucial. This paper aims to evaluate the ability of LLMs (GPT4, GPT-3.5, LLaMA2, and PaLM 2) and VLMs (GPT4V and Gemini Pro Vision) to estimate their verbalized uncertainty via prompting. We propose the new Japanese Uncertain Scenes (JUS) dataset, aimed at testing VLM capabilities via difficult queries and object counting, and the Net Calibration Error (NCE) to measure direction of miscalibration. Results show that both LLMs and VLMs have a high calibration error and are overconfident most of the time, indicating a poor capability for uncertainty estimation. Additionally we develop prompts for regression tasks, and we show that VLMs have poor calibration when producing mean/standard deviation and 95% confidence intervals.
Related papers
- Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial.
Our research investigates the fragility of uncertainty estimation and explores potential attacks.
We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z) - Calibrating Large Language Models Using Their Generations Only [44.26441565763495]
APRICOT is a method to set confidence targets and train an additional model that predicts an LLM's confidence based on its textual input and output alone.
It is conceptually simple, does not require access to the target model beyond its output, does not interfere with the language generation, and has a multitude of potential usages.
We show how our approach performs competitively in terms of calibration error for white-box and black-box LLMs on closed-book question-answering to detect incorrect LLM answers.
arXiv Detail & Related papers (2024-03-09T17:46:24Z) - Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models [84.94220787791389]
We propose Fact-and-Reflection (FaR) prompting, which improves the LLM calibration in two steps.
Experiments show that FaR achieves significantly better calibration; it lowers the Expected Error by 23.5%.
FaR even elicits the capability of verbally expressing concerns in less confident scenarios.
arXiv Detail & Related papers (2024-02-27T01:37:23Z) - Uncertainty-Aware Evaluation for Vision-Language Models [0.0]
Current evaluation methods overlook an essential component: uncertainty.
We show that models with the highest accuracy may also have the highest uncertainty.
Our empirical findings also reveal a correlation between model uncertainty and its language model part.
arXiv Detail & Related papers (2024-02-22T10:04:17Z) - Don't Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection [29.138463029748547]
This paper explores the capability of Large Language Models to detect implicit hate speech and express confidence in their responses.
Our findings highlight that LLMs exhibit two extremes: (1) LLMs display excessive sensitivity towards groups or topics that may cause fairness issues, resulting in misclassifying benign statements as hate speech.
arXiv Detail & Related papers (2024-02-18T00:04:40Z) - The Calibration Gap between Model and Human Confidence in Large Language
Models [14.539888672603743]
Large language models (LLMs) need to be well-calibrated in the sense that they can accurately assess and communicate how likely it is that their predictions are correct.
Recent work has focused on the quality of internal LLM confidence assessments.
This paper explores the disparity between external human confidence in an LLM's responses and the internal confidence of the model.
arXiv Detail & Related papers (2024-01-24T22:21:04Z) - Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods.
We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification.
Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z) - A Survey of Confidence Estimation and Calibration in Large Language Models [86.692994151323]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks in various domains.
Despite their impressive performance, they can be unreliable due to factual errors in their generations.
Assessing their confidence and calibrating them across different tasks can help mitigate risks and enable LLMs to produce better generations.
arXiv Detail & Related papers (2023-11-14T16:43:29Z) - Embers of Autoregression: Understanding Large Language Models Through
the Problem They are Trained to Solve [21.55766758950951]
We make predictions about the strategies that large language models will adopt to solve next-word prediction tasks.
We evaluate two LLMs on eleven tasks and find robust evidence that LLMs are influenced by probability.
We conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system.
arXiv Detail & Related papers (2023-09-24T13:35:28Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [60.61002524947733]
Previous confidence elicitation methods rely on white-box access to internal model information or model fine-tuning.
This leads to a growing need to explore the untapped area of black-box approaches for uncertainty estimation.
We define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency.
arXiv Detail & Related papers (2023-06-22T17:31:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.