Prompting is not a substitute for probability measurements in large
language models
- URL: http://arxiv.org/abs/2305.13264v2
- Date: Mon, 23 Oct 2023 14:12:59 GMT
- Title: Prompting is not a substitute for probability measurements in large
language models
- Authors: Jennifer Hu and Roger Levy
- Abstract summary: We compare metalinguistic prompting and direct probability measurements as ways of measuring models' linguistic knowledge.
Our findings suggest that negative results relying on metalinguistic prompts cannot be taken as conclusive evidence that an LLM lacks a particular linguistic generalization.
Our results also highlight the value that is lost with the move to closed APIs where access to probability distributions is limited.
- Score: 22.790531588072245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prompting is now a dominant method for evaluating the linguistic knowledge of
large language models (LLMs). While other methods directly read out models'
probability distributions over strings, prompting requires models to access
this internal information by processing linguistic input, thereby implicitly
testing a new type of emergent ability: metalinguistic judgment. In this study,
we compare metalinguistic prompting and direct probability measurements as ways
of measuring models' linguistic knowledge. Broadly, we find that LLMs'
metalinguistic judgments are inferior to quantities directly derived from
representations. Furthermore, consistency gets worse as the prompt query
diverges from direct measurements of next-word probabilities. Our findings
suggest that negative results relying on metalinguistic prompts cannot be taken
as conclusive evidence that an LLM lacks a particular linguistic
generalization. Our results also highlight the value that is lost with the move
to closed APIs where access to probability distributions is limited.
Related papers
- Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method [108.56493934296687]
We introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection.
We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text.
arXiv Detail & Related papers (2024-09-23T07:55:35Z) - How to Compute the Probability of a Word [45.23856093235994]
This paper derives the correct methods for computing word probabilities.
We show that correcting the widespread bug in probability computations affects measured outcomes in sentence comprehension and lexical optimisation analyses.
arXiv Detail & Related papers (2024-06-20T17:59:42Z) - What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages [78.1866280652834]
Large language models (LM) are distributions over strings.
We investigate the learnability of regular LMs (RLMs) by RNN and Transformer LMs.
We find that the complexity of the RLM rank is strong and significant predictors of learnability for both RNNs and Transformers.
arXiv Detail & Related papers (2024-06-06T17:34:24Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Uncertainty Quantification for In-Context Learning of Large Language Models [52.891205009620364]
In-context learning has emerged as a groundbreaking ability of Large Language Models (LLMs)
We propose a novel formulation and corresponding estimation method to quantify both types of uncertainties.
The proposed method offers an unsupervised way to understand the prediction of in-context learning in a plug-and-play fashion.
arXiv Detail & Related papers (2024-02-15T18:46:24Z) - A novel approach to measuring the scope of patent claims based on probabilities obtained from (large) language models [0.0]
This work proposes to measure the scope of a patent claim as the reciprocal of self-information contained in this claim.
The more surprising the information required to define the claim, the narrower its scope.
arXiv Detail & Related papers (2023-09-17T16:50:07Z) - Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models [37.63939774027709]
Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities.
We propose and compare several confidence/uncertainty measures, applying them to *selective NLG* where unreliable results could either be ignored or yielded for further assessment.
Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses.
arXiv Detail & Related papers (2023-05-30T16:31:26Z) - Evaluating Distributional Distortion in Neural Language Modeling [81.83408583979745]
A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language.
Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate.
We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
arXiv Detail & Related papers (2022-03-24T01:09:46Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.