Refusal Behavior in Large Language Models: A Nonlinear Perspective
- URL: http://arxiv.org/abs/2501.08145v1
- Date: Tue, 14 Jan 2025 14:23:18 GMT
- Title: Refusal Behavior in Large Language Models: A Nonlinear Perspective
- Authors: Fabian Hildebrandt, Andreas Maier, Patrick Krauss, Achim Schilling,
- Abstract summary: Refusal behavior in large language models (LLMs) enables them to decline responding to harmful, unethical, or inappropriate prompts.<n>This paper investigates refusal behavior across six LLMs from three architectural families.
- Score: 2.979183050755201
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Refusal behavior in large language models (LLMs) enables them to decline responding to harmful, unethical, or inappropriate prompts, ensuring alignment with ethical standards. This paper investigates refusal behavior across six LLMs from three architectural families. We challenge the assumption of refusal as a linear phenomenon by employing dimensionality reduction techniques, including PCA, t-SNE, and UMAP. Our results reveal that refusal mechanisms exhibit nonlinear, multidimensional characteristics that vary by model architecture and layer. These findings highlight the need for nonlinear interpretability to improve alignment research and inform safer AI deployment strategies.
Related papers
- Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.
We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z) - The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence [57.57786477441956]
Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request.
We propose a novel gradient-based approach to representation engineering and use it to identify refusal directions.
We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions.
arXiv Detail & Related papers (2025-02-24T18:52:59Z) - Generalized Factor Neural Network Model for High-dimensional Regression [50.554377879576066]
We tackle the challenges of modeling high-dimensional data sets with latent low-dimensional structures hidden within complex, non-linear, and noisy relationships.
Our approach enables a seamless integration of concepts from non-parametric regression, factor models, and neural networks for high-dimensional regression.
arXiv Detail & Related papers (2025-02-16T23:13:55Z) - Visual Error Patterns in Multi-Modal AI: A Statistical Approach [0.0]
Multi-modal large language models (MLLMs) excel at integrating text and visual data but face systematic challenges when interpreting ambiguous or incomplete visual stimuli.<n>This study leverages statistical modeling to analyze the factors driving these errors, using a dataset of geometric stimuli characterized by features like 3D, rotation, and missing face/side.
arXiv Detail & Related papers (2024-11-27T01:20:08Z) - Representation Surgery: Theory and Practice of Affine Steering [72.61363182652853]
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text.
One natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations.
This paper investigates the formal and empirical properties of steering functions.
arXiv Detail & Related papers (2024-02-15T00:20:30Z) - Layered Models can "Automatically" Regularize and Discover Low-Dimensional Structures via Feature Learning [6.109362130047454]
We study a two-layer nonparametric regression model where the input undergoes a linear transformation followed by a nonlinear mapping to predict the output.
We show that the two-layer model can "automatically" induce regularization and facilitate feature learning.
arXiv Detail & Related papers (2023-10-18T06:15:35Z) - Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake
Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges.
Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs.
This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z) - Automatically Correcting Large Language Models: Surveying the landscape
of diverse self-correction strategies [104.32199881187607]
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks.
A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output.
This paper presents a comprehensive review of this emerging class of techniques.
arXiv Detail & Related papers (2023-08-06T18:38:52Z) - Learning Nonlinear Projections for Reduced-Order Modeling of Dynamical
Systems using Constrained Autoencoders [0.0]
We introduce a class of nonlinear projections described by constrained autoencoder neural networks in which both the manifold and the projection fibers are learned from data.
Our architecture uses invertible activation functions and biorthogonal weight matrices to ensure that the encoder is a left inverse of the decoder.
We also introduce new dynamics-aware cost functions that promote learning of oblique projection fibers that account for fast dynamics and nonnormality.
arXiv Detail & Related papers (2023-07-28T04:01:48Z) - XAI-TRIS: Non-linear image benchmarks to quantify false positive
post-hoc attribution of feature importance [1.3958169829527285]
A lack of formal underpinning leaves it unclear as to what conclusions can safely be drawn from the results of a given XAI method.
This means that challenging non-linear problems, typically solved by deep neural networks, presently lack appropriate remedies.
We show that popular XAI methods are often unable to significantly outperform random performance baselines and edge detection methods.
arXiv Detail & Related papers (2023-06-22T11:31:11Z) - Learning Reduced Nonlinear State-Space Models: an Output-Error Based
Canonical Approach [8.029702645528412]
We investigate the effectiveness of deep learning in the modeling of dynamic systems with nonlinear behavior.
We show its ability to identify three different nonlinear systems.
The performances are evaluated in terms of open-loop prediction on test data generated in simulation as well as a real world data-set of unmanned aerial vehicle flight measurements.
arXiv Detail & Related papers (2022-04-19T06:33:23Z) - Nonlinear ISA with Auxiliary Variables for Learning Speech
Representations [51.9516685516144]
We introduce a theoretical framework for nonlinear Independent Subspace Analysis (ISA) in the presence of auxiliary variables.
We propose an algorithm that learns unsupervised speech representations whose subspaces are independent.
arXiv Detail & Related papers (2020-07-25T14:53:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.