Extending Activation Steering to Broad Skills and Multiple Behaviours
- URL: http://arxiv.org/abs/2403.05767v1
- Date: Sat, 9 Mar 2024 02:30:04 GMT
- Title: Extending Activation Steering to Broad Skills and Multiple Behaviours
- Authors: Teun van der Weij, Massimo Poesio, Nandi Schoots
- Abstract summary: We investigate the efficacy of activation steering for broad skills and multiple behaviours.
We find that steering broader skills is competitive to steering narrower skills.
We steer models to become more or less myopic and wealth-seeking.
- Score: 5.40770929004319
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current large language models have dangerous capabilities, which are likely
to become more problematic in the future. Activation steering techniques can be
used to reduce risks from these capabilities. In this paper, we investigate the
efficacy of activation steering for broad skills and multiple behaviours.
First, by comparing the effects of reducing performance on general coding
ability and Python-specific ability, we find that steering broader skills is
competitive to steering narrower skills. Second, we steer models to become more
or less myopic and wealth-seeking, among other behaviours. In our experiments,
combining steering vectors for multiple different behaviours into one steering
vector is largely unsuccessful. On the other hand, injecting individual
steering vectors at different places in a model simultaneously is promising.
Related papers
- Learning Dexterous Bimanual Catch Skills through Adversarial-Cooperative Heterogeneous-Agent Reinforcement Learning [2.12918068324152]
We propose a novel framework for learning dexterous bimanual catching skills using Heterogeneous-Agent Reinforcement Learning (HARL)
Our approach introduces an adversarial reward scheme, where a throw agent increases the difficulty of throws-adjusting speed-while a catch agent learns to coordinate both hands to catch objects.
We evaluate the framework in simulated environments using 15 different objects, demonstrating robustness and versatility in handling diverse objects.
arXiv Detail & Related papers (2025-02-17T04:50:45Z) - DexterityGen: Foundation Controller for Unprecedented Dexterity [67.15251368211361]
Teaching robots dexterous manipulation skills, such as tool use, presents a significant challenge.
Current approaches can be broadly categorized into two strategies: human teleoperation (for imitation learning) and sim-to-real reinforcement learning.
We introduce DexterityGen, which uses RL to pretrain large-scale dexterous motion primitives, such as in-hand rotation or translation.
In the real world, we use human teleoperation as a prompt to the controller to produce highly dexterous behavior.
arXiv Detail & Related papers (2025-02-06T18:49:35Z) - Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering [0.0]
This paper explores activation engineering, where outputs of pre-trained LLMs are controlled by manipulating their activations at inference time.
We introduce conceptors - mathematical constructs that represent sets of activation vectors as ellipsoidal regions.
Our experiments demonstrate that conceptors outperform traditional methods across multiple steering tasks.
arXiv Detail & Related papers (2024-10-09T10:09:37Z) - Analyzing the Generalization and Reliability of Steering Vectors [8.253773195379166]
We show that steering vectors have substantial limitations both in- and out-of-distribution.
In-distribution, steerability is highly variable across different inputs.
Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt.
arXiv Detail & Related papers (2024-07-17T08:32:03Z) - Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data.
This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization.
Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z) - Reinforcement Learning for Versatile, Dynamic, and Robust Bipedal Locomotion Control [106.32794844077534]
This paper presents a study on using deep reinforcement learning to create dynamic locomotion controllers for bipedal robots.
We develop a general control solution that can be used for a range of dynamic bipedal skills, from periodic walking and running to aperiodic jumping and standing.
This work pushes the limits of agility for bipedal robots through extensive real-world experiments.
arXiv Detail & Related papers (2024-01-30T10:48:43Z) - Improving Activation Steering in Language Models with Mean-Centring [10.101141087916133]
We find that taking the average of activations associated with a target dataset, and subtracting the mean of all training activations, results in effective steering vectors.
We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin.
arXiv Detail & Related papers (2023-12-06T18:27:07Z) - Learning and Adapting Agile Locomotion Skills by Transferring Experience [71.8926510772552]
We propose a framework for training complex robotic skills by transferring experience from existing controllers to jumpstart learning new tasks.
We show that our method enables learning complex agile jumping behaviors, navigating to goal locations while walking on hind legs, and adapting to new environments.
arXiv Detail & Related papers (2023-04-19T17:37:54Z) - Learning energy-efficient driving behaviors by imitating experts [75.12960180185105]
This paper examines the role of imitation learning in bridging the gap between control strategies and realistic limitations in communication and sensing.
We show that imitation learning can succeed in deriving policies that, if adopted by 5% of vehicles, may boost the energy-efficiency of networks with varying traffic conditions by 15% using only local observations.
arXiv Detail & Related papers (2022-06-28T17:08:31Z) - ASE: Large-Scale Reusable Adversarial Skill Embeddings for Physically
Simulated Characters [123.88692739360457]
General-purpose motor skills enable humans to perform complex tasks.
These skills also provide powerful priors for guiding their behaviors when learning new tasks.
We present a framework for learning versatile and reusable skill embeddings for physically simulated characters.
arXiv Detail & Related papers (2022-05-04T06:13:28Z) - Learning Agile Robotic Locomotion Skills by Imitating Animals [72.36395376558984]
Reproducing the diverse and agile locomotion skills of animals has been a longstanding challenge in robotics.
We present an imitation learning system that enables legged robots to learn agile locomotion skills by imitating real-world animals.
arXiv Detail & Related papers (2020-04-02T02:56:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.