Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding
- URL: http://arxiv.org/abs/2601.06403v1
- Date: Sat, 10 Jan 2026 02:56:38 GMT
- Title: Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding
- Authors: Yijiang River Dong, Tiancheng Hu, Zheng Hui, Nigel Collier,
- Abstract summary: Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona.<n>We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control.
- Score: 33.569783099301695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona, as post-training instills strong priors that resist conflicting instructions. We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control. By contrasting logits from target and default system prompts, we isolate and amplify the behavioral signal unique to the target persona by a scalar factor alpha. Across five diverse benchmarks spanning constraint satisfaction, behavioral control, pluralistic alignment, capability modulation, and stylistic control, our method yields substantial improvements: up to +8.5 strict accuracy on IFEval, +45pp refusal rate on OffTopicEval, and +13% steerability on Prompt-Steering. Our approach enables practitioners to modulate system prompt strength, providing dynamic control over model behavior without retraining.
Related papers
- PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding [85.22047087898311]
We introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings.<n>PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses to reinforce desirable outcomes.<n>Experiments on the "3H" alignment objectives demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time.
arXiv Detail & Related papers (2026-02-24T08:56:52Z) - Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs [6.715533531385597]
Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods.<n>We propose a contrastive Sparse AutoEncoder framework that learns facet-level personality control vectors aligned with the Big Five 30-facet model.
arXiv Detail & Related papers (2026-02-22T12:39:02Z) - Steering Language Models Before They Speak: Logit-Level Interventions [9.055997973281919]
We propose a training-free inference-time logit intervention for controllable generation.<n>Our results show that statistically grounded logit steering can achieve large, consistent, and multi-task control gains.
arXiv Detail & Related papers (2026-01-16T03:00:33Z) - AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z) - End-to-End Visual Autonomous Parking via Control-Aided Attention [30.52881549605385]
CAA-Policy is an end-to-end imitation learning system for precise parking.<n>It allows control signal to guide the learning of visual attention via a novel Control-Aided Attention mechanism.
arXiv Detail & Related papers (2025-09-14T04:51:19Z) - Instruction Following by Boosting Attention of Large Language Models [11.739148611340964]
latent steering is a lightweight technique that alters internal activations to guide generation.<n>InstABoost boosts the strength of instruction prompting by altering the model's attention during generation.<n>InstABoost demonstrates superior control success compared to both traditional prompting and latent steering.
arXiv Detail & Related papers (2025-06-16T17:42:35Z) - Can Large Reasoning Models Self-Train? [51.0277533541394]
We use majority voting as a simple self-feedback mechanism to study whether self-training can be sustained within reinforcement learning.<n>We find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration.<n>Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking, resulting in sudden and complete performance collapse.
arXiv Detail & Related papers (2025-05-27T17:16:00Z) - SMART: Self-supervised Multi-task pretrAining with contRol Transformers [34.604339091596884]
Self-supervised pretraining has been extensively studied in language and vision domains.
It is difficult to properly design such a pretraining approach for sequential decision-making tasks.
We propose a generic pretraining framework for sequential decision making.
arXiv Detail & Related papers (2023-01-24T05:01:23Z) - When Does Contrastive Learning Preserve Adversarial Robustness from
Pretraining to Finetuning? [99.4914671654374]
We propose AdvCL, a novel adversarial contrastive pretraining framework.
We show that AdvCL is able to enhance cross-task robustness transferability without loss of model accuracy and finetuning efficiency.
arXiv Detail & Related papers (2021-11-01T17:59:43Z) - Self-Progressing Robust Training [146.8337017922058]
Current robust training methods such as adversarial training explicitly uses an "attack" to generate adversarial examples.
We propose a new framework called SPROUT, self-progressing robust training.
Our results shed new light on scalable, effective and attack-independent robust training methods.
arXiv Detail & Related papers (2020-12-22T00:45:24Z) - Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness.
We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z) - Efficient Empowerment Estimation for Unsupervised Stabilization [75.32013242448151]
empowerment principle enables unsupervised stabilization of dynamical systems at upright positions.
We propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel.
We show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images.
arXiv Detail & Related papers (2020-07-14T21:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.