Related papers: Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal

Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal

URL: http://arxiv.org/abs/2601.09886v1
Date: Wed, 14 Jan 2026 21:38:54 GMT
Title: Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal
Authors: Sathvik Nair, Byung-Doh Oh,
Abstract summary: How predictable a word is can be quantified in two ways: using human responses to the cloze task or using probabilities from language models (LMs)<n>We present evidence for three hypotheses about the advantage of LM probabilities.
Score: 7.591490481106253
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How predictable a word is can be quantified in two ways: using human responses to the cloze task or using probabilities from language models (LMs).When used as predictors of processing effort, LM probabilities outperform probabilities derived from cloze data. However, it is important to establish that LM probabilities do so for the right reasons, since different predictors can lead to different scientific conclusions about the role of prediction in language comprehension. We present evidence for three hypotheses about the advantage of LM probabilities: not suffering from low resolution, distinguishing semantically similar words, and accurately assigning probabilities to low-frequency words. These results call for efforts to improve the resolution of cloze studies, coupled with experiments on whether human-like prediction is also as sensitive to the fine-grained distinctions made by LM probabilities.

Related papers

Probabilities Are All You Need: A Probability-Only Approach to Uncertainty Estimation in Large Language Models [13.41454380481593]
Uncertainty estimation, often using predictive entropy estimation, is key to addressing this issue.<n>This paper proposes an efficient, training-free uncertainty estimation method that approximates predictive entropy using the responses' top-$K$ probabilities.
arXiv Detail & Related papers (2025-11-10T23:31:43Z)
Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs [47.20307724127832]
We present the first comprehensive study of the reasoning capabilities of large language models (LLMs)<n>We evaluate models on three carefully designed tasks, mode identification, maximum likelihood estimation, and sample generation.<n>Through empirical evaluations, we demonstrate that there exists a clear performance gap between smaller and larger models.
arXiv Detail & Related papers (2025-09-12T22:58:05Z)
Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method [108.56493934296687]
We introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection.<n>We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text.
arXiv Detail & Related papers (2024-09-23T07:55:35Z)
Probabilistic Medical Predictions of Large Language Models [4.825666689707888]
Large Language Models (LLMs) have shown promise in clinical applications through prompt engineering.<n>LLMs struggle to produce reliable prediction probabilities, which are crucial for transparency and decision-making.<n>We compared explicit probabilities from text generation to implicit probabilities derived from the likelihood of predicting the correct label token.
arXiv Detail & Related papers (2024-08-21T03:47:17Z)
Calibrated Large Language Models for Binary Question Answering [49.1574468325115]
A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct. We propose a novel approach that utilizes the inductive Venn--Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels.
arXiv Detail & Related papers (2024-07-01T09:31:03Z)
To Believe or Not to Believe Your LLM [51.2579827761899]
We explore uncertainty quantification in large language models (LLMs) We derive an information-theoretic metric that allows to reliably detect when only epistemic uncertainty is large. We conduct a series of experiments which demonstrate the advantage of our formulation.
arXiv Detail & Related papers (2024-06-04T17:58:18Z)
Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment [36.82878715850013]
Merrill et al. argue that, in theory, sentence co-occurrence probabilities predicted by an optimal LM should reflect the entailment relationship of the constituent sentences. We investigate whether their theory can be used to decode entailment relations from neural LMs. We find that a test similar to theirs can decode entailment relations between natural sentences, well above random chance, though not perfectly.
arXiv Detail & Related papers (2024-02-21T17:36:07Z)
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling. Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)
Naturalistic Causal Probing for Morpho-Syntax [76.83735391276547]
We suggest a naturalistic strategy for input-level intervention on real world data in Spanish. Using our approach, we isolate morpho-syntactic features from counfounders in sentences. We apply this methodology to analyze causal effects of gender and number on contextualized representations extracted from pre-trained models.
arXiv Detail & Related papers (2022-05-14T11:47:58Z)
Evaluating Distributional Distortion in Neural Language Modeling [81.83408583979745]
A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language. Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate. We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
arXiv Detail & Related papers (2022-03-24T01:09:46Z)
The Language Model Understood the Prompt was Ambiguous: Probing Syntactic Uncertainty Through Generation [23.711953448400514]
We inspect to which extent neural language models (LMs) exhibit uncertainty over such analyses. We find that LMs can track multiple analyses simultaneously. As a response to disambiguating cues, the LMs often select the correct interpretation, but occasional errors point to potential areas of improvement.
arXiv Detail & Related papers (2021-09-16T10:27:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.