Related papers: A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

URL: http://arxiv.org/abs/2601.18939v1
Date: Mon, 26 Jan 2026 20:20:13 GMT
Title: A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy
Authors: Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi, Sunishchal Dev, Kevin Zhu, Sean O'Brien, Ashwinee Panda, Ryan Lagasse,
Abstract summary: Behavioral alignment in large language models is often achieved through broad fine-tuning.<n>We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior.<n>Our results show that sparse, neuron-level updates offer a scalable and precise alternative to full-model fine-tuning.
Score: 7.405817106579332
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate this approach on the task of reducing sycophantic behavior, where our method matches or exceeds state-of-the-art performance on four benchmarks (Syco-Bench, NLP, POLI, PHIL) using Gemma-2-2B and 9B models. Our results show that sparse, neuron-level updates offer a scalable and precise alternative to full-model fine-tuning, remaining effective even in situations when little data is available

Related papers

Why Machine Learning Models Systematically Underestimate Extreme Values II: How to Fix It with LatentNN [0.2700171473617699]
Attenuation bias affects astronomical data-driven models.<n>We show that neural networks suffer from the same attenuation bias.<n>We introduce LatentNN, a method that jointly optimize network parameters and latent input values.
arXiv Detail & Related papers (2025-12-29T01:59:10Z)
Revisiting Large Language Model Pruning using Neuron Semantic Attribution [63.62836612864512]
We conduct evaluations on 24 datasets and 4 tasks using popular pruning methods.<n>We surprisingly find a significant performance drop of existing pruning methods in sentiment classification tasks.<n>We propose Neuron Semantic Attribution, which learns to associate each neuron with specific semantics.
arXiv Detail & Related papers (2025-03-03T13:52:17Z)
Not Everything is All You Need: Toward Low-Redundant Optimization for Large Language Model Alignment [126.34547428473968]
Large language models (LLMs) are still struggling in aligning with human preference in complex tasks and scenarios. We propose a low-redundant alignment method named textbfALLO, focusing on optimizing the most related neurons with the most useful supervised signals. Experimental results on 10 datasets have shown the effectiveness of ALLO.
arXiv Detail & Related papers (2024-06-18T13:34:40Z)
A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints [87.08677547257733]
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning. We show how to maximize the likelihood of a symbolic constraint w.r.t the neural network's output distribution. We also evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation.
arXiv Detail & Related papers (2023-12-06T20:58:07Z)
DANAA: Towards transferable attacks with double adversarial neuron attribution [37.33924432015966]
We propose a double adversarial neuron attribution attack method, termed DANAA', to obtain more accurate feature importance estimation. The goal is to measure the weight of individual neurons and retain the features that are more important towards transferability.
arXiv Detail & Related papers (2023-10-16T14:11:32Z)
Magnificent Minified Models [0.360953887026184]
This paper concerns itself with the task of taking a large trained neural network and 'compressing' it to be smaller by deleting parameters or entire neurons. We compare various methods of parameter and neuron selection: dropout-based neuron damage estimation, neuron merging, absolute-value based selection, random selection. For neuron-level pruning, retraining from scratch did much better in our experiments.
arXiv Detail & Related papers (2023-06-16T21:00:44Z)
Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding [82.46024259137823]
We propose a cross-model comparative loss for a broad range of tasks. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks.
arXiv Detail & Related papers (2023-01-10T03:04:27Z)
Improving Adversarial Transferability via Neuron Attribution-Based Attacks [35.02147088207232]
We propose the Neuron-based Attack (NAA), which conducts feature-level attacks with more accurate neuron importance estimations. We derive an approximation scheme of neuron attribution to tremendously reduce the overhead. Experiments confirm the superiority of our approach to the state-of-the-art benchmarks.
arXiv Detail & Related papers (2022-03-31T13:47:30Z)
Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation. We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)
And/or trade-off in artificial neurons: impact on adversarial robustness [91.3755431537592]
Presence of sufficient number of OR-like neurons in a network can lead to classification brittleness and increased vulnerability to adversarial attacks. We define AND-like neurons and propose measures to increase their proportion in the network. Experimental results on the MNIST dataset suggest that our approach holds promise as a direction for further exploration.
arXiv Detail & Related papers (2021-02-15T08:19:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.