Related papers: Next-Token Prediction Should be Ambiguity-Sensitive: A Meta-Learning Perspective

Next-Token Prediction Should be Ambiguity-Sensitive: A Meta-Learning Perspective

URL: http://arxiv.org/abs/2506.16288v1
Date: Thu, 19 Jun 2025 13:05:12 GMT
Title: Next-Token Prediction Should be Ambiguity-Sensitive: A Meta-Learning Perspective
Authors: Leo Gagnon, Eric Elmoznino, Sarthak Mittal, Tom Marty, Tejas Kasetty, Dhanya Sridhar, Guillaume Lajoie,
Abstract summary: We show that Transformers indeed struggle with high-ambiguity predictions across model sizes.<n>Preliminary results show substantial gains in ambiguous contexts through improved capacity allocation and test-time scalable inference.
Score: 12.655285605773932
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid adaptation ability of auto-regressive foundation models is often attributed to the diversity of their pre-training data. This is because, from a Bayesian standpoint, minimizing prediction error in such settings requires integrating over all plausible latent hypotheses consistent with observations. While this behavior is desirable in principle, it often proves too ambitious in practice: under high ambiguity, the number of plausible latent alternatives makes Bayes-optimal prediction computationally intractable. Cognitive science has long recognized this limitation, suggesting that under such conditions, heuristics or information-seeking strategies are preferable to exhaustive inference. Translating this insight to next-token prediction, we hypothesize that low- and high-ambiguity predictions pose different computational demands, making ambiguity-agnostic next-token prediction a detrimental inductive bias. To test this, we introduce MetaHMM, a synthetic sequence meta-learning benchmark with rich compositional structure and a tractable Bayesian oracle. We show that Transformers indeed struggle with high-ambiguity predictions across model sizes. Motivated by cognitive theories, we propose a method to convert pre-trained models into Monte Carlo predictors that decouple task inference from token prediction. Preliminary results show substantial gains in ambiguous contexts through improved capacity allocation and test-time scalable inference, though challenges remain.

Related papers

I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
Exchangeable Sequence Models Quantify Uncertainty Over Latent Concepts [6.256239986541708]
We show that pre-trained sequence models are naturally capable of probabilistic reasoning over exchangeable data points.<n>A sequence model learns the relationship between observations, which differs from typical Bayesian models.<n>We show the sequence prediction loss controls the quality of uncertainty quantification.
arXiv Detail & Related papers (2024-08-06T17:16:10Z)
Calibrated Large Language Models for Binary Question Answering [49.1574468325115]
A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct. We propose a novel approach that utilizes the inductive Venn--Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels.
arXiv Detail & Related papers (2024-07-01T09:31:03Z)
Movement-Prediction-Adjusted Naïve Forecast [6.935130578959931]
This study introduces a movement-prediction-adjusted na"ive forecast for time series exhibiting symmetric random walk characteristics.<n>The adjusted na"ive forecast achieved statistically significant improvements even at relatively low directional accuracy levels.<n>These findings imply that the movement-prediction-adjusted na"ive forecast can serve as an effective second-stage method for forecasting symmetric random walk time series.
arXiv Detail & Related papers (2024-06-20T16:32:18Z)
Finding Islands of Predictability in Action Forecasting [7.215559809521136]
We show that future action sequences are more accurately modeled with variable, rather than one, levels of abstraction. We propose a combination Bayesian neural network and hierarchical convolutional segmentation model to both accurately predict future actions and optimally select abstraction levels.
arXiv Detail & Related papers (2022-10-13T21:01:16Z)
Predictive Inference with Feature Conformal Prediction [80.77443423828315]
We propose feature conformal prediction, which extends the scope of conformal prediction to semantic feature spaces. From a theoretical perspective, we demonstrate that feature conformal prediction provably outperforms regular conformal prediction under mild assumptions. Our approach could be combined with not only vanilla conformal prediction, but also other adaptive conformal prediction methods.
arXiv Detail & Related papers (2022-10-01T02:57:37Z)
Uncertainty estimation of pedestrian future trajectory using Bayesian approximation [137.00426219455116]
Under dynamic traffic scenarios, planning based on deterministic predictions is not trustworthy. The authors propose to quantify uncertainty during forecasting using approximation which deterministic approaches fail to capture. The effect of dropout weights and long-term prediction on future state uncertainty has been studied.
arXiv Detail & Related papers (2022-05-04T04:23:38Z)
Learning to Predict Trustworthiness with Steep Slope Loss [69.40817968905495]
We study the problem of predicting trustworthiness on real-world large-scale datasets. We observe that the trustworthiness predictors trained with prior-art loss functions are prone to view both correct predictions and incorrect predictions to be trustworthy. We propose a novel steep slope loss to separate the features w.r.t. correct predictions from the ones w.r.t. incorrect predictions by two slide-like curves that oppose each other.
arXiv Detail & Related papers (2021-09-30T19:19:09Z)
Set Prediction without Imposing Structure as Conditional Density Estimation [40.86881969839325]
We propose an alternative to training via set losses by viewing learning as conditional density estimation. Our framework fits deep energy-based models and approximates the intractable likelihood with gradient-guided sampling. Our approach is competitive with previous set prediction models on standard benchmarks.
arXiv Detail & Related papers (2020-10-08T16:49:16Z)
Ambiguity in Sequential Data: Predicting Uncertain Futures with Recurrent Models [110.82452096672182]
We propose an extension of the Multiple Hypothesis Prediction (MHP) model to handle ambiguous predictions with sequential data. We also introduce a novel metric for ambiguous problems, which is better suited to account for uncertainties.
arXiv Detail & Related papers (2020-03-10T09:15:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.