Evaluating language models as risk scores
- URL: http://arxiv.org/abs/2407.14614v3
- Date: Mon, 23 Sep 2024 10:46:48 GMT
- Title: Evaluating language models as risk scores
- Authors: André F. Cruz, Moritz Hardt, Celestine Mendler-Dünner,
- Abstract summary: We introduce folktexts, a software package to generate risk scores using question-answering LLMs.
We evaluate 17 recent LLMs across five proposed benchmark tasks.
We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated.
- Score: 23.779329697527054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate LLMs' ability to quantify ground-truth outcome uncertainty. In this work, we focus on the use of LLMs as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using LLMs, and evaluate them against US Census data products. A flexible API enables the use of different prompting schemes, local or web-hosted models, and diverse census columns that can be used to compose custom prediction tasks. We evaluate 17 recent LLMs across five proposed benchmark tasks. We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated. Base models consistently overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and produce over-confident risk scores. In fact, instruction-tuning polarizes answer distribution regardless of true underlying data uncertainty. This reveals a general inability of instruction-tuned LLMs to express data uncertainty using multiple-choice answers. A separate experiment using verbalized chat-style risk queries yields substantially improved calibration across instruction-tuned models. These differences in ability to quantify data uncertainty cannot be revealed in realizable settings, and highlight a blind-spot in the current evaluation ecosystem that folktexts covers.
Related papers
- Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial.
Our research investigates the fragility of uncertainty estimation and explores potential attacks.
We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification [116.77055746066375]
Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output.
We propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification.
arXiv Detail & Related papers (2024-03-07T17:44:17Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z) - Quantifying Uncertainty in Answers from any Language Model and Enhancing
their Trustworthiness [16.35655151252159]
We introduce BSDetector, a method for detecting bad and speculative answers from a pretrained Large Language Model.
Our uncertainty quantification technique works for any LLM accessible only via a black-box API.
arXiv Detail & Related papers (2023-08-30T17:53:25Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models [37.63939774027709]
Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities.
We propose and compare several confidence/uncertainty measures, applying them to *selective NLG* where unreliable results could either be ignored or yielded for further assessment.
Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses.
arXiv Detail & Related papers (2023-05-30T16:31:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.