Language Models Can Predict Their Own Behavior
- URL: http://arxiv.org/abs/2502.13329v2
- Date: Mon, 22 Sep 2025 18:41:38 GMT
- Title: Language Models Can Predict Their Own Behavior
- Authors: Dhananjay Ashok, Jonathan May,
- Abstract summary: Language models (LMs) can exhibit specific behaviors,' such as a failure to follow alignment training, that we hope to detect and react to during deployment.<n>We show that probes trained on the internal representation of input tokens alone can predict a wide range of eventual behaviors over the entire output sequence.<n>An early warning system built on the probes reduces jailbreaking by 91%.
- Score: 29.566208688211876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The text produced by language models (LMs) can exhibit specific `behaviors,' such as a failure to follow alignment training, that we hope to detect and react to during deployment. Identifying these behaviors can often only be done post facto, i.e., after the entire text of the output has been generated. We provide evidence that there are times when we can predict how an LM will behave early in computation, before even a single token is generated. We show that probes trained on the internal representation of input tokens alone can predict a wide range of eventual behaviors over the entire output sequence. Using methods from conformal prediction, we provide provable bounds on the estimation error of our probes, creating precise early warning systems for these behaviors. The conformal probes can identify instances that will trigger alignment failures (jailbreaking) and instruction-following failures, without requiring a single token to be generated. An early warning system built on the probes reduces jailbreaking by 91%. Our probes also show promise in pre-emptively estimating how confident the model will be in its response, a behavior that cannot be detected using the output text alone. Conformal probes can preemptively estimate the final prediction of an LM that uses Chain-of-Thought (CoT) prompting, hence accelerating inference. When applied to an LM that uses CoT to perform text classification, the probes drastically reduce inference costs (65% on average across 27 datasets), with negligible accuracy loss. Encouragingly, probes generalize to unseen datasets and perform better on larger models, suggesting applicability to the largest of models in real-world settings.
Related papers
- The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models [10.377264470934843]
Backdoor attacks on large language models typically couple a secret trigger to an explicit malicious output.<n>We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger.<n>We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size.
arXiv Detail & Related papers (2025-11-16T02:01:58Z) - Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models [14.840508854268522]
Reasoning language models improve performance on complex tasks by generating long chains of thought (CoTs)<n>We evaluate a range of monitoring methods using either CoT text or activations.<n>We find that a simple linear probe trained on CoT activations significantly outperforms all text-based baselines in predicting whether a final response is safe or unsafe.
arXiv Detail & Related papers (2025-07-16T17:16:03Z) - Probing for Arithmetic Errors in Language Models [86.8227317662622]
Internal activations in language models can be used to detect arithmetic errors.<n>We show that simple probes can accurately decode both the model's predicted output and the correct answer from hidden states.<n>We train lightweight error detectors that predict model correctness with over 90% accuracy.
arXiv Detail & Related papers (2025-07-16T16:27:50Z) - Semantic Probabilistic Control of Language Models [27.866585120174484]
Semantic control involves steering LM generations towards satisfying subtle non-lexical constraints.<n>We leverage a verifier's information gradient to efficiently reason over all generations that satisfy the target attribute.<n>We evaluate the effectiveness of our approach in controlling the toxicity, sentiment, and topic-adherence of LMs yielding generations satisfying the constraint with high probability.
arXiv Detail & Related papers (2025-05-04T01:21:28Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.
LCD can distort the global distribution over strings, sampling tokens based only on local information.
We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Learning-Order Autoregressive Models with Application to Molecular Graph Generation [52.44913282062524]
We introduce a variant of ARM that generates high-dimensional data using a probabilistic ordering that is sequentially inferred from data.
We demonstrate experimentally that our method can learn meaningful autoregressive orderings in image and graph generation.
arXiv Detail & Related papers (2025-03-07T23:24:24Z) - Predicting Through Generation: Why Generation Is Better for Prediction [10.098410272203301]
This paper argues that generating output tokens is more effective than using pooled representations for prediction tasks because token-level generation retains more mutual information.<n>We introduce PredGen, an end to end framework that (i) uses scheduled sampling to reduce exposure bias, and (ii) introduces a task adapter to convert the generated tokens into structured outputs.<n>Our results show that PredGen consistently outperforms standard baselines, demonstrating its effectiveness in structured prediction tasks.
arXiv Detail & Related papers (2025-02-25T03:48:19Z) - Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification [50.717692060500696]
Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling.<n>Next-token prediction can be made robust so as to achieve $C=tilde O(H)$, representing moderate error amplification.<n>No computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e(log H)1-Omega(1)$.
arXiv Detail & Related papers (2025-02-18T02:52:00Z) - Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial.<n>In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations.<n>We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z) - Predicting Emergent Capabilities by Finetuning [98.9684114851891]
We find that finetuning language models can shift the point in scaling at which emergence occurs towards less capable models.
We validate this approach using four standard NLP benchmarks.
We find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged.
arXiv Detail & Related papers (2024-11-25T01:48:09Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction.
Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution.
This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - "My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models [40.867655189493924]
Open-ended nature of language generation makes evaluation of large language models (LLMs) challenging.
One common evaluation approach uses multiple-choice questions (MCQ) to limit the response space.
We evaluate how aligned first-token evaluation is with the text output along several dimensions.
arXiv Detail & Related papers (2024-02-22T12:47:33Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Conformal Language Modeling [61.94417935386489]
We propose a novel approach to conformal prediction for generative language models (LMs)
Standard conformal prediction produces prediction sets with rigorous, statistical guarantees.
We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation.
arXiv Detail & Related papers (2023-06-16T21:55:08Z) - Induced Natural Language Rationales and Interleaved Markup Tokens Enable
Extrapolation in Large Language Models [8.166629393064097]
The ability to extrapolate, i.e., to make predictions on sequences that are longer than those presented as training examples, is a challenging problem for deep learning models.
Recent work shows that this limitation persists in state-of-the-art Transformer-based models.
We demonstrate that large language models can succeed in extrapolation without modifying their architecture or training procedure.
arXiv Detail & Related papers (2022-08-24T11:25:27Z) - Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese
Grammatical Error Correction [49.25830718574892]
We present a new framework named Tail-to-Tail (textbfTtT) non-autoregressive sequence prediction.
Considering that most tokens are correct and can be conveyed directly from source to target, and the error positions can be estimated and corrected.
Experimental results on standard datasets, especially on the variable-length datasets, demonstrate the effectiveness of TtT in terms of sentence-level Accuracy, Precision, Recall, and F1-Measure.
arXiv Detail & Related papers (2021-06-03T05:56:57Z) - Bayes DistNet -- A Robust Neural Network for Algorithm Runtime
Distribution Predictions [1.8275108630751844]
Randomized algorithms are used in many state-of-the-art solvers for constraint satisfaction problems (CSP) and Boolean satisfiability (SAT) problems.
Previous state-of-the-art methods directly try to predict a fixed parametric distribution that the input instance follows.
This new model achieves robust predictive performance in the low observation setting, as well as handling censored observations.
arXiv Detail & Related papers (2020-12-14T01:15:39Z) - Information-Theoretic Probing with Minimum Description Length [74.29846942213445]
We propose an alternative to the standard probes, information-theoretic probing with minimum description length (MDL)
With MDL probing, training a probe to predict labels is recast as teaching it to effectively transmit the data.
We show that these methods agree in results and are more informative and stable than the standard probes.
arXiv Detail & Related papers (2020-03-27T09:35:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.