COSMIC: Generalized Refusal Direction Identification in LLM Activations
- URL: http://arxiv.org/abs/2506.00085v1
- Date: Fri, 30 May 2025 04:54:18 GMT
- Title: COSMIC: Generalized Refusal Direction Identification in LLM Activations
- Authors: Vincent Siu, Nicholas Crispino, Zihao Yu, Sam Pan, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang,
- Abstract summary: We introduce bfCOSMIC (Cosine Similarity Metrics for Inversion of Concepts), an automated framework for direction selection.<n>It identifies viable steering directions and target layers using cosine similarity - entirely independent of model outputs.<n>It reliably identifies refusal directions in adversarial settings and weakly aligned models, and is capable of steering such models toward safer behavior with minimal increase in false refusals.
- Score: 43.30637889861949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) encode behaviors such as refusal within their activation space, yet identifying these behaviors remains a significant challenge. Existing methods often rely on predefined refusal templates detectable in output tokens or require manual analysis. We introduce \textbf{COSMIC} (Cosine Similarity Metrics for Inversion of Concepts), an automated framework for direction selection that identifies viable steering directions and target layers using cosine similarity - entirely independent of model outputs. COSMIC achieves steering performance comparable to prior methods without requiring assumptions about a model's refusal behavior, such as the presence of specific refusal tokens. It reliably identifies refusal directions in adversarial settings and weakly aligned models, and is capable of steering such models toward safer behavior with minimal increase in false refusals, demonstrating robustness across a wide range of alignment conditions.
Related papers
- Persona Features Control Emergent Misalignment [4.716981217776586]
We show that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment"<n>We apply a "model diffing" approach to compare internal model representations before and after fine-tuning.<n>We also investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
arXiv Detail & Related papers (2025-06-24T17:38:21Z) - The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence [57.57786477441956]
Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request.<n>We propose a novel gradient-based approach to representation engineering and use it to identify refusal directions.<n>We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions.
arXiv Detail & Related papers (2025-02-24T18:52:59Z) - Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts [11.81523319216474]
Steering methods manipulate the representations of large language models (LLMs) to induce responses that have desired properties.<n>Traditionally, steering has relied on supervision, such as from contrastive pairs of prompts that vary in a single target concept.<n>We introduce Sparse Shift Autoencoders (SSAEs) that instead map the differences between embeddings to sparse representations.
arXiv Detail & Related papers (2025-02-14T08:49:41Z) - Refusal in LLMs is an Affine Function [1.722461331472526]
We propose affine concept editing (ACE) as an approach for steering language models' behavior.<n>ACE combines affine subspace projection and activation addition to reliably control the model's refusal responses.<n>Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods.
arXiv Detail & Related papers (2024-11-13T20:12:55Z) - PseudoNeg-MAE: Self-Supervised Point Cloud Learning using Conditional Pseudo-Negative Embeddings [55.55445978692678]
PseudoNeg-MAE enhances global feature representation of point cloud masked autoencoders by making them both discriminative and sensitive to transformations.<n>We propose a novel loss that explicitly penalizes invariant collapse, enabling the network to capture richer transformation cues while preserving discriminative representations.
arXiv Detail & Related papers (2024-09-24T07:57:21Z) - LUCID-GAN: Conditional Generative Models to Locate Unfairness [1.5257247496416746]
We present LUCID-GAN, which generates canonical inputs via a conditional generative model instead of gradient-based inverse design.
We empirically evaluate LUCID-GAN on the UCI Adult and COMPAS data sets and show that it allows for detecting unethical biases in black-box models without requiring access to the training data.
arXiv Detail & Related papers (2023-07-28T10:37:49Z) - Learning non-Markovian Decision-Making from State-only Sequences [57.20193609153983]
We develop a model-based imitation of state-only sequences with non-Markov Decision Process (nMDP)
We demonstrate the efficacy of the proposed method in a path planning task with non-Markovian constraints.
arXiv Detail & Related papers (2023-06-27T02:26:01Z) - Toward Certified Robustness Against Real-World Distribution Shifts [65.66374339500025]
We train a generative model to learn perturbations from data and define specifications with respect to the output of the learned model.
A unique challenge arising from this setting is that existing verifiers cannot tightly approximate sigmoid activations.
We propose a general meta-algorithm for handling sigmoid activations which leverages classical notions of counter-example-guided abstraction refinement.
arXiv Detail & Related papers (2022-06-08T04:09:13Z) - Calibrating Over-Parametrized Simulation Models: A Framework via
Eligibility Set [3.862247454265944]
We develop a framework to develop calibration schemes that satisfy rigorous frequentist statistical guarantees.
We demonstrate our methodology on several numerical examples, including an application to calibration of a limit order book market simulator.
arXiv Detail & Related papers (2021-05-27T00:59:29Z) - Unsupervised Anomaly Detection with Adversarial Mirrored AutoEncoders [51.691585766702744]
We propose a variant of Adversarial Autoencoder which uses a mirrored Wasserstein loss in the discriminator to enforce better semantic-level reconstruction.
We put forward an alternative measure of anomaly score to replace the reconstruction-based metric.
Our method outperforms the current state-of-the-art methods for anomaly detection on several OOD detection benchmarks.
arXiv Detail & Related papers (2020-03-24T08:26:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.