Related papers: Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

URL: http://arxiv.org/abs/2507.12428v2
Date: Tue, 07 Oct 2025 16:30:40 GMT
Title: Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Authors: Yik Siu Chan, Zheng-Xin Yong, Stephen H. Bach,
Abstract summary: Reasoning language models improve performance on complex tasks by generating long chains of thought (CoTs)<n>We evaluate a range of monitoring methods using either CoT text or activations.<n>We find that a simple linear probe trained on CoT activations significantly outperforms all text-based baselines in predicting whether a final response is safe or unsafe.
Score: 14.840508854268522
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning language models improve performance on complex tasks by generating long chains of thought (CoTs), but this process can also increase harmful outputs in adversarial settings. In this work, we ask whether the long CoTs can be leveraged for predictive safety monitoring: do the reasoning traces provide early signals of final response alignment that could enable timely intervention? We evaluate a range of monitoring methods using either CoT text or activations, including highly capable large language models, fine-tuned classifiers, and humans. First, we find that a simple linear probe trained on CoT activations significantly outperforms all text-based baselines in predicting whether a final response is safe or unsafe, with an average absolute increase of 13 in F1 scores over the best-performing alternatives. CoT texts are often unfaithful and misleading, while model latents provide a more reliable predictive signal. Second, the probe can be applied to early CoT segments before the response is generated, showing that alignment signals appear before reasoning completes. Error analysis reveals that the performance gap between text classifiers and the linear probe largely stems from a subset of responses we call performative CoTs, where the reasoning consistently contradicts the final response as the CoT progresses. Our findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.

Related papers

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering [5.427346259545067]
Chain-of-thought (CoT) has become central to scaling reasoning capabilities in large language models.<n>We show that instruction-tuned models often determine their answer before generating CoT.
arXiv Detail & Related papers (2026-03-02T04:33:55Z)
Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens [12.788799173865]
We quantify inference-time effort by identifying deep-thinking tokens.<n>Think@n is a test-time scaling strategy that prioritizes samples with high deep-thinking ratios.
arXiv Detail & Related papers (2026-02-13T23:07:37Z)
PROMISE: Process Reward Models Unlock Test-Time Scaling Laws in Generative Recommendations [52.67948063133533]
Generative Recommendation has emerged as a promising paradigm, reformulating recommendation as a sequence-to-sequence generation task over hierarchical Semantic IDs.<n>Existing methods suffer from a critical issue we term Semantic Drift, where errors in early, high-level tokens irreversibly divert the generation trajectory into irrelevant semantic subspaces.<n>We propose Promise, a novel framework that integrates dense, step-by-step verification into generative models.
arXiv Detail & Related papers (2026-01-08T07:38:46Z)
How Does Prefix Matter in Reasoning Model Tuning? [57.69882799751655]
We fine-tune three R1 series models across three core model capabilities: reasoning (mathematics), coding, safety, and factuality.<n>Results show that prefix-conditioned SFT improves both safety and reasoning performance, yielding up to +6% higher Safe@1 accuracy.
arXiv Detail & Related papers (2026-01-04T18:04:23Z)
A Pragmatic Way to Measure Chain-of-Thought Monitorability [10.811252340660907]
Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI safety.<n>To help preserve monitorability, we propose a pragmatic way to measure two components of it: legibility and coverage.<n>We implement these metrics with an autorater prompt that enables any capable LLM to compute the legibility and coverage of existing CoTs.
arXiv Detail & Related papers (2025-10-28T00:44:25Z)
The Coverage Principle: How Pre-Training Enables Post-Training [70.25788947586297]
We study how pre-training shapes the success of the final model.<n>We uncover a mechanism that explains the power of coverage in predicting downstream performance.
arXiv Detail & Related papers (2025-10-16T17:53:50Z)
Large language models can learn and generalize steganographic chain-of-thought under process supervision [5.173324198381261]
Chain-of-thought (CoT) reasoning provides insights into decision-making processes.<n>CoT monitoring can be used to reduce risks associated with deploying models.<n>We show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings.
arXiv Detail & Related papers (2025-06-02T17:45:15Z)
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [53.18562650350898]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z)
Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z)
Language Models Can Predict Their Own Behavior [28.80639362933004]
We show that internal representation of input tokens alone can often precisely predict, not just the next token, but eventual behavior over the entire output sequence.<n>We leverage this capacity and learn probes on internal states to create early warning (and exit) systems.<n>Specifically, if the probes can confidently estimate the way the LM is going to behave, then the system will avoid generating tokens altogether and return the estimated behavior instead.
arXiv Detail & Related papers (2025-02-18T23:13:16Z)
Conformal Generative Modeling with Improved Sample Efficiency through Sequential Greedy Filtering [55.15192437680943]
Generative models lack rigorous statistical guarantees for their outputs.<n>We propose a sequential conformal prediction method producing prediction sets that satisfy a rigorous statistical guarantee.<n>This guarantee states that with high probability, the prediction sets contain at least one admissible (or valid) example.
arXiv Detail & Related papers (2024-10-02T15:26:52Z)
CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models [16.436592723426305]
It is unclear whether language models produce the same value for different ways of assigning joint probabilities to word spans. Our work introduces a novel framework, ConTestS, involving statistical tests to assess score consistency across interchangeable completion and conditioning orders.
arXiv Detail & Related papers (2024-09-30T06:24:43Z)
Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs [63.36637269634553]
We introduce a novel approach where LLMs are fine-tuned to generate a sequence of Diverse Chains of Thought (DCoT) within a single inference step.<n>We show that fine-tuning on DCoT improves performance over the CoT baseline across model families and scales.<n>Our work is also significant because both quantitative analyses and manual evaluations reveal the observed gains stem from the models' ability to refine an initial reasoning chain.
arXiv Detail & Related papers (2024-07-03T15:01:18Z)
Markovian Transformers for Informative Language Modeling [0.9642500063568188]
Chain-of-Thought (CoT) reasoning often fails to faithfully reflect a language model's underlying decision process.<n>We make CoT causally essential in a "Markovian" language model, factoring next-token prediction through an intermediate CoT and training it to predict future tokens independently of the original prompt.
arXiv Detail & Related papers (2024-04-29T17:36:58Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
Language Models Explain Word Reading Times Better Than Empirical Predictability [20.38397241720963]
The traditional approach in cognitive reading research assumes that word predictability from sentence context is best captured by cloze completion probability. Probability language models provide deeper explanations for syntactic and semantic effects than CCP. N-gram and RNN probabilities of the present word more consistently predicted reading performance compared with topic models or CCP.
arXiv Detail & Related papers (2022-02-02T16:38:43Z)
Complex Event Forecasting with Prediction Suffix Trees: Extended Technical Report [70.7321040534471]
Complex Event Recognition (CER) systems have become popular in the past two decades due to their ability to "instantly" detect patterns on real-time streams of events. There is a lack of methods for forecasting when a pattern might occur before such an occurrence is actually detected by a CER engine. We present a formal framework that attempts to address the issue of Complex Event Forecasting.
arXiv Detail & Related papers (2021-09-01T09:52:31Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.