Related papers: Patterns and Mechanisms of Contrastive Activation Engineering

Patterns and Mechanisms of Contrastive Activation Engineering

URL: http://arxiv.org/abs/2505.03189v1
Date: Tue, 06 May 2025 05:15:12 GMT
Title: Patterns and Mechanisms of Contrastive Activation Engineering
Authors: Yixiong Hao, Ayush Panda, Stepan Shabalin, Sheikh Abdur Raheem Ali,
Abstract summary: CAE has the potential to introduce a new paradigm of flexible, task-specific behavior tuning.<n>We analyze the performance of CAE in in in-distribution, out-of-distribution settings, evaluate drawbacks, and begin to develop comprehensive guidelines for its effective deployment.
Score: 0.374490703387131
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Controlling the behavior of Large Language Models (LLMs) remains a significant challenge due to their inherent complexity and opacity. While techniques like fine-tuning can modify model behavior, they typically require extensive computational resources. Recent work has introduced a class of contrastive activation engineering (CAE) techniques as promising approaches for steering LLM outputs through targeted modifications to their internal representations. Applied at inference-time with zero cost, CAE has the potential to introduce a new paradigm of flexible, task-specific LLM behavior tuning. We analyze the performance of CAE in in-distribution, out-of-distribution settings, evaluate drawbacks, and begin to develop comprehensive guidelines for its effective deployment. We find that 1. CAE is only reliably effective when applied to in-distribution contexts. 2. Increasing the number of samples used to generate steering vectors has diminishing returns at around 80 samples. 3. Steering vectors are susceptible to adversarial inputs that reverses the behavior that is steered for. 4. Steering vectors harm the overall model perplexity. 5. Larger models are more resistant to steering-induced degradation.

Related papers

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs [15.23174472320989]
Large Language Models (LLMs) are central to many contemporary AI applications.<n>Recent works in eXplainable AI (XAI) suggest that interpretability can also enable model compression.
arXiv Detail & Related papers (2025-06-16T17:38:36Z)
Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs [3.2361985831403404]
Activation steering provides an alternative for inference-time control.<n>We introduce a novel approach using a lightweight, trainable controller network integrated during inference.
arXiv Detail & Related papers (2025-05-22T01:48:38Z)
ExpertSteer: Intervening in LLMs through Expert Knowledge [71.12193680015622]
Activation steering offers a promising method to control the generation process of Large Language Models.<n>We propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors.<n>We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains.
arXiv Detail & Related papers (2025-05-18T08:55:46Z)
Improving Reasoning Performance in Large Language Models via Representation Engineering [2.0099933815960256]
We propose a representation engineering approach for large language models (LLMs)<n>Model activations are read from the residual stream of an LLM when processing a reasoning task.<n>We show that an LLM can, to a certain degree, be controlled to improve its perceived reasoning ability by modulating activations.
arXiv Detail & Related papers (2025-04-28T04:58:43Z)
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.<n> Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Investigating Generalization of One-shot LLM Steering Vectors [21.2431937128876]
We propose optimizing steering vectors through gradient descent on a single training example.<n>We find that the resulting vectors effectively mediate safety-relevant behaviors in multiple models.
arXiv Detail & Related papers (2025-02-26T06:13:01Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Steering Llama 2 via Contrastive Activation Addition [41.54815073311959]
Contrastive Activation Addition (CAA) is a method for steering language models by modifying their activations during forward passes. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs)
arXiv Detail & Related papers (2023-12-09T04:40:46Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
Automatic Rule Induction for Efficient Semi-Supervised Learning [56.91428251227253]
Semi-supervised learning has shown promise in allowing NLP models to generalize from small amounts of labeled data. Pretrained transformer models act as black-box correlation engines that are difficult to explain and sometimes behave unreliably. We propose tackling both of these challenges via Automatic Rule Induction (ARI), a simple and general-purpose framework.
arXiv Detail & Related papers (2022-05-18T16:50:20Z)
UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks. Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy. The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.