Evaluating language models as risk scores
- URL: http://arxiv.org/abs/2407.14614v1
- Date: Fri, 19 Jul 2024 18:13:37 GMT
- Title: Evaluating language models as risk scores
- Authors: André F. Cruz, Moritz Hardt, Celestine Mendler-Dünner,
- Abstract summary: We focus on the use of language models as risk scores for unrealizable prediction tasks.
We introduce folktexts, a software package to systematically generate risk scores using large language models.
We demonstrate the utility of folktexts through a sweep of empirical insights on 16 recent large language models.
- Score: 23.779329697527054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate language models' ability to quantify outcome uncertainty. In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using large language models, and evaluate them against benchmark prediction tasks. Specifically, the package derives natural language tasks from US Census data products, inspired by popular tabular data benchmarks. A flexible API allows for any task to be constructed out of 28 census features whose values are mapped to prompt-completion pairs. We demonstrate the utility of folktexts through a sweep of empirical insights on 16 recent large language models, inspecting risk scores, calibration curves, and diverse evaluation metrics. We find that zero-shot risk sores have high predictive signal while being widely miscalibrated: base models overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and generate over-confident risk scores.
Related papers
- VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation [16.889939234103153]
We propose to variabilize benchmarks and evaluate language models dynamically.
Specifically, we extract variables from each test case and define a value range for each variable.
For each evaluation, we sample new values from these value ranges to create unique test cases, thus ensuring a fresh evaluation each time.
arXiv Detail & Related papers (2024-06-25T16:13:53Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI
Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models.
The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control.
Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - CUE: An Uncertainty Interpretation Framework for Text Classifiers Built
on Pre-Trained Language Models [28.750894873827068]
We propose a novel framework, called CUE, which aims to interpret uncertainties inherent in the predictions of PLM-based models.
By comparing the difference in predictive uncertainty between the perturbed and the original text representations, we are able to identify the latent dimensions responsible for uncertainty.
arXiv Detail & Related papers (2023-06-06T11:37:46Z) - Conformal Prediction with Large Language Models for Multi-Choice
Question Answering [7.049780432343948]
We find that the uncertainty estimates from conformal prediction are tightly correlated with prediction accuracy.
This work contributes towards more trustworthy and reliable usage of large language models in safety-critical situations.
arXiv Detail & Related papers (2023-05-28T15:26:10Z) - Can Prompt Probe Pretrained Language Models? Understanding the Invisible
Risks from a Causal View [37.625078897220305]
Prompt-based probing has been widely used in evaluating the abilities of pretrained language models (PLMs)
This paper investigates the prompt-based probing from a causal view, highlights three critical biases which could induce biased results and conclusions, and proposes to conduct debiasing via causal intervention.
arXiv Detail & Related papers (2022-03-23T08:10:07Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Are Some Words Worth More than Others? [3.5598388686985354]
We propose two new intrinsic evaluation measures within the framework of a simple word prediction task.
We evaluate several commonly-used large English language models using our proposed metrics.
arXiv Detail & Related papers (2020-10-12T23:12:11Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.