An Evaluation of Estimative Uncertainty in Large Language Models
- URL: http://arxiv.org/abs/2405.15185v1
- Date: Fri, 24 May 2024 03:39:31 GMT
- Title: An Evaluation of Estimative Uncertainty in Large Language Models
- Authors: Zhisheng Tang, Ke Shen, Mayank Kejriwal,
- Abstract summary: Estimative uncertainty has long been an area of study -- including by intelligence agencies like the CIA.
This study compares estimative uncertainty in commonly used large language models (LLMs) to that of humans, and to each other.
We show that LLMs like GPT-3.5 and GPT-4 align with human estimates for some, but not all, WEPs presented in English.
- Score: 3.04503073434724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Words of estimative probability (WEPs), such as ''maybe'' or ''probably not'' are ubiquitous in natural language for communicating estimative uncertainty, compared with direct statements involving numerical probability. Human estimative uncertainty, and its calibration with numerical estimates, has long been an area of study -- including by intelligence agencies like the CIA. This study compares estimative uncertainty in commonly used large language models (LLMs) like GPT-4 and ERNIE-4 to that of humans, and to each other. Here we show that LLMs like GPT-3.5 and GPT-4 align with human estimates for some, but not all, WEPs presented in English. Divergence is also observed when the LLM is presented with gendered roles and Chinese contexts. Further study shows that an advanced LLM like GPT-4 can consistently map between statistical and estimative uncertainty, but a significant performance gap remains. The results contribute to a growing body of research on human-LLM alignment.
Related papers
- Better Estimation of the KL Divergence Between Language Models [58.7977683502207]
Estimating the Kullback--Leibler (KL) divergence between language models has many applications.
We introduce a Rao--Blackwellized estimator that is also unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator.
arXiv Detail & Related papers (2025-04-14T18:40:02Z) - Can Language Models Learn Typologically Implausible Languages? [62.823015163987996]
Grammatical features across human languages show intriguing correlations often attributed to learning biases in humans.
We discuss how language models (LMs) allow us to better determine the role of domain-general learning biases in language universals.
We test LMs on an array of highly naturalistic but counterfactual versions of the English (head-initial) and Japanese (head-final) languages.
arXiv Detail & Related papers (2025-02-17T20:40:01Z) - Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models [36.983534612895156]
In the recent past, a popular way of evaluating natural language understanding (NLU) was to consider a model's ability to perform natural language inference (NLI) tasks.
This paper focuses on five different NLI benchmarks across six models of different scales.
We investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training.
arXiv Detail & Related papers (2024-11-21T13:09:36Z) - Large Language Models Must Be Taught to Know What They Don't Know [97.90008709512921]
We show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead.
We also investigate the mechanisms that enable reliable uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators.
arXiv Detail & Related papers (2024-06-12T16:41:31Z) - Are Large Language Models Good Statisticians? [10.42853117200315]
StatQA is a new benchmark designed for statistical analysis tasks.
We show that even state-of-the-art models such as GPT-4o achieve a best performance of only 64.83%.
While open-source LLMs show limited capability, those fine-tuned ones exhibit marked improvements.
arXiv Detail & Related papers (2024-06-12T02:23:51Z) - Uncertainty in Language Models: Assessment through Rank-Calibration [65.10149293133846]
Language Models (LMs) have shown promising performance in natural language generation.
It is crucial to correctly quantify their uncertainty in responding to given inputs.
We develop a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs.
arXiv Detail & Related papers (2024-04-04T02:31:05Z) - GPT-4 Surpassing Human Performance in Linguistic Pragmatics [0.0]
This study investigates the ability of Large Language Models (LLMs) to comprehend and interpret linguistic pragmatics.
Using Grice's communication principles, LLMs and human subjects were evaluated based on their responses to various dialogue-based tasks.
The findings revealed the superior performance and speed of LLMs, particularly GPT4, over human subjects in interpreting pragmatics.
arXiv Detail & Related papers (2023-12-15T05:40:15Z) - Language Models Hallucinate, but May Excel at Fact Verification [89.0833981569957]
Large language models (LLMs) frequently "hallucinate," resulting in non-factual outputs.
Even GPT-3.5 produces factual outputs less than 25% of the time.
This underscores the importance of fact verifiers in order to measure and incentivize progress.
arXiv Detail & Related papers (2023-10-23T04:39:01Z) - Large Language Models are biased to overestimate profoundness [0.0]
This study evaluates GPT-4 and various other large language models (LLMs) in judging the profoundness of mundane, motivational, and pseudo-profound statements.
We found a significant statement-to-statement correlation between the LLMs and humans, irrespective of the type of statements and the prompting technique used.
arXiv Detail & Related papers (2023-10-22T21:33:50Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Evaluating Distributional Distortion in Neural Language Modeling [81.83408583979745]
A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language.
Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate.
We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
arXiv Detail & Related papers (2022-03-24T01:09:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.