Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms
- URL: http://arxiv.org/abs/2505.20322v2
- Date: Tue, 03 Jun 2025 13:40:17 GMT
- Title: Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms
- Authors: Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, Ningyu Zhang,
- Abstract summary: The vast number of parameters in models often results in highly intertwined internal representations.<n>Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering.<n>We propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety.
- Score: 71.85633762642125
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating atomic knowledge components. In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety. Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis reveals that steering exhibits superior robustness and flexibility, particularly in adversarial scenarios. We also apply the steering strategy to the large reasoning model, confirming its effectiveness in precise reasoning control.
Related papers
- Control-ITRA: Controlling the Behavior of a Driving Model [14.31198056147624]
We introduce a method called Control-ITRA to influence agent behavior through waypoint assignment and target speed modulation.<n>We demonstrate that our method can generate controllable, infraction-free trajectories while preserving realism in both seen and unseen locations.
arXiv Detail & Related papers (2025-01-17T03:35:11Z) - Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control [44.326363467045496]
Large Language Models (LLMs) have become a critical area of research in Reinforcement Learning from Human Feedback (RLHF)
representation engineering offers a new, training-free approach.
This technique leverages semantic features to control the representation of LLM's intermediate hidden states.
It is difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature.
arXiv Detail & Related papers (2024-11-04T08:36:03Z) - Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations [76.19419888353586]
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations.
We present our efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms.
arXiv Detail & Related papers (2024-03-09T21:07:16Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Learning Exactly Linearizable Deep Dynamics Models [0.06554326244334867]
We propose a learning method for exactly linearizable dynamical models that can easily apply various control theories to ensure stability, reliability, etc.<n>The proposed model is employed for the real-time control of an automotive engine, and the results demonstrate good predictive performance and stable control under constraints.
arXiv Detail & Related papers (2023-11-30T05:40:55Z) - Empowering Autonomous Driving with Large Language Models: A Safety Perspective [82.90376711290808]
This paper explores the integration of Large Language Models (LLMs) into Autonomous Driving systems.
LLMs are intelligent decision-makers in behavioral planning, augmented with a safety verifier shield for contextual safety learning.
We present two key studies in a simulated environment: an adaptive LLM-conditioned Model Predictive Control (MPC) and an LLM-enabled interactive behavior planning scheme with a state machine.
arXiv Detail & Related papers (2023-11-28T03:13:09Z) - Constrained Reinforcement Learning for Robotics via Scenario-Based
Programming [64.07167316957533]
It is crucial to optimize the performance of DRL-based agents while providing guarantees about their behavior.
This paper presents a novel technique for incorporating domain-expert knowledge into a constrained DRL training loop.
Our experiments demonstrate that using our approach to leverage expert knowledge dramatically improves the safety and the performance of the agent.
arXiv Detail & Related papers (2022-06-20T07:19:38Z) - Scalable Synthesis of Verified Controllers in Deep Reinforcement
Learning [0.0]
We propose an automated verification pipeline capable of synthesizing high-quality safety shields.
Our key insight involves separating safety verification from neural controller, using pre-computed verified safety shields to constrain neural controller training.
Experimental results over a range of realistic high-dimensional deep RL benchmarks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-04-20T19:30:29Z) - Runtime Safety Assurance Using Reinforcement Learning [37.61747231296097]
This paper aims to design a meta-controller capable of identifying unsafe situations with high accuracy.
We frame the design of RTSA with the Markov decision process (MDP) and use reinforcement learning (RL) to solve it.
arXiv Detail & Related papers (2020-10-20T20:54:46Z) - Evaluating the Safety of Deep Reinforcement Learning Models using
Semi-Formal Verification [81.32981236437395]
We present a semi-formal verification approach for decision-making tasks based on interval analysis.
Our method obtains comparable results over standard benchmarks with respect to formal verifiers.
Our approach allows to efficiently evaluate safety properties for decision-making models in practical applications.
arXiv Detail & Related papers (2020-10-19T11:18:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.