Related papers: Depth-Wise Activation Steering for Honest Language Models

Depth-Wise Activation Steering for Honest Language Models

URL: http://arxiv.org/abs/2512.07667v1
Date: Mon, 08 Dec 2025 16:03:06 GMT
Title: Depth-Wise Activation Steering for Honest Language Models
Authors: Gracjan Góral, Marysia Winkels, Steven Basart,
Abstract summary: We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule.<n>We evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines.
Score: 4.299414551764217
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations, indicating that how intervention is distributed across depth materially affects outcomes beyond total strength. The method is simple, model-agnostic, requires no finetuning, and provides a low-cost control knob for eliciting truthful reporting from models' existing capabilities.

Related papers

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [108.26461635308796]
We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment.<n>Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models.<n>We introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training.
arXiv Detail & Related papers (2026-02-04T15:24:52Z)
CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning [4.765206163164323]
CLEANER exploits intrinsic self-correction capabilities to eliminate error-contaminated context during data collection.<n>Similarity-Aware Adaptive Rollback mechanism autonomously constructs clean, purified trajectories.<n>Results show average accuracy gains of 6%, 3%, and 5% over baselines.
arXiv Detail & Related papers (2026-01-21T16:14:30Z)
Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling [90.87033586963828]
Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs)<n>We propose Self-Consistency Sampling (SCS) to correct this issue.<n>Based on Qwen2.5-VL-7B-Instruct, SCS improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation.
arXiv Detail & Related papers (2025-11-13T18:59:57Z)
Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z)
Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank [71.09032766271493]
Large language models (LLMs) are prone to errors and hallucinations.<n>How to check their outputs effectively and efficiently has become a critical problem in their applications.
arXiv Detail & Related papers (2025-10-28T11:01:10Z)
AWM: Accurate Weight-Matrix Fingerprint for Large Language Models [44.93519442566325]
We propose a training-free fingerprinting method based on weight matrices.<n>We leverage the Linear Assignment Problem (LAP) and an unbiased Centered Kernel Alignment (CKA) similarity to neutralize the effects of parameter manipulations.<n>Our method demonstrates exceptional robustness against all six aforementioned post-training categories while exhibiting a near-zero risk of false positives.
arXiv Detail & Related papers (2025-10-08T07:51:11Z)
Boosting LLM Reasoning via Spontaneous Self-Correction [43.4980625253775]
One of the approaches for improving math reasoning is self-correction.<n>Existing self-correction approaches treat corrections as standalone post-generation refinements.<n>We propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass.
arXiv Detail & Related papers (2025-06-07T21:23:00Z)
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs) We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z)
Fine-Tuning Pre-trained Language Model with Weak Supervision: A Contrastive-Regularized Self-Training Approach [46.76317056976196]
Fine-tuned pre-trained language models (LMs) have achieved enormous success in many natural language processing (NLP) tasks. We study the problem of fine-tuning pre-trained LMs using only weak supervision, without any labeled data. We develop a contrastive self-training framework, COSINE, to enable fine-tuning LMs with weak supervision.
arXiv Detail & Related papers (2020-10-15T15:55:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.