Related papers: Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding

Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding

URL: http://arxiv.org/abs/2601.06403v1
Date: Sat, 10 Jan 2026 02:56:38 GMT
Title: Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding
Authors: Yijiang River Dong, Tiancheng Hu, Zheng Hui, Nigel Collier,
Abstract summary: Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona.<n>We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control.
Score: 33.569783099301695
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona, as post-training instills strong priors that resist conflicting instructions. We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control. By contrasting logits from target and default system prompts, we isolate and amplify the behavioral signal unique to the target persona by a scalar factor alpha. Across five diverse benchmarks spanning constraint satisfaction, behavioral control, pluralistic alignment, capability modulation, and stylistic control, our method yields substantial improvements: up to +8.5 strict accuracy on IFEval, +45pp refusal rate on OffTopicEval, and +13% steerability on Prompt-Steering. Our approach enables practitioners to modulate system prompt strength, providing dynamic control over model behavior without retraining.

Related papers

PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding [85.22047087898311]
We introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings.<n>PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses to reinforce desirable outcomes.<n>Experiments on the "3H" alignment objectives demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time.
arXiv Detail & Related papers (2026-02-24T08:56:52Z)
Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs [6.715533531385597]
Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods.<n>We propose a contrastive Sparse AutoEncoder framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model.
arXiv Detail & Related papers (2026-02-22T12:39:02Z)
Steering Language Models Before They Speak: Logit-Level Interventions [9.055997973281919]
We propose a training-free inference-time logit intervention for controllable generation.<n>Our results show that statistically grounded logit steering can achieve large, consistent, and multi-task control gains.
arXiv Detail & Related papers (2026-01-16T03:00:33Z)
AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z)
End-to-End Visual Autonomous Parking via Control-Aided Attention [30.52881549605385]
CAA-Policy is an end-to-end imitation learning system for precise parking.<n>It allows control signal to guide the learning of visual attention via a novel Control-Aided Attention mechanism.
arXiv Detail & Related papers (2025-09-14T04:51:19Z)
Instruction Following by Boosting Attention of Large Language Models [11.739148611340964]
latent steering is a lightweight technique that alters internal activations to guide generation.<n>InstABoost boosts the strength of instruction prompting by altering the model's attention during generation.<n>InstABoost demonstrates superior control success compared to both traditional prompting and latent steering.
arXiv Detail & Related papers (2025-06-16T17:42:35Z)
Can Large Reasoning Models Self-Train? [51.0277533541394]
We use majority voting as a simple self-feedback mechanism to study whether self-training can be sustained within reinforcement learning.<n>We find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration.<n>Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking, resulting in sudden and complete performance collapse.
arXiv Detail & Related papers (2025-05-27T17:16:00Z)
SMART: Self-supervised Multi-task pretrAining with contRol Transformers [34.604339091596884]
Self-supervised pretraining has been extensively studied in language and vision domains. It is difficult to properly design such a pretraining approach for sequential decision-making tasks. We propose a generic pretraining framework for sequential decision making.
arXiv Detail & Related papers (2023-01-24T05:01:23Z)
When Does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning? [99.4914671654374]
We propose AdvCL, a novel adversarial contrastive pretraining framework. We show that AdvCL is able to enhance cross-task robustness transferability without loss of model accuracy and finetuning efficiency.
arXiv Detail & Related papers (2021-11-01T17:59:43Z)
Self-Progressing Robust Training [146.8337017922058]
Current robust training methods such as adversarial training explicitly uses an "attack" to generate adversarial examples. We propose a new framework called SPROUT, self-progressing robust training. Our results shed new light on scalable, effective and attack-independent robust training methods.
arXiv Detail & Related papers (2020-12-22T00:45:24Z)
Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness. We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z)
Efficient Empowerment Estimation for Unsupervised Stabilization [75.32013242448151]
empowerment principle enables unsupervised stabilization of dynamical systems at upright positions. We propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel. We show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images.
arXiv Detail & Related papers (2020-07-14T21:10:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.