AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
- URL: http://arxiv.org/abs/2501.17148v2
- Date: Wed, 29 Jan 2025 18:52:56 GMT
- Title: AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
- Authors: Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, Christopher Potts,
- Abstract summary: We introduce AxBench, a large-scale benchmark for steering and concept detection.
For steering, we find that prompting outperforms all existing methods, followed by finetuning.
For concept detection, representation-based methods such as difference-in-means, perform the best.
- Score: 73.37603699731329
- License:
- Abstract: Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.
Related papers
- Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts [11.81523319216474]
Steering methods manipulate the representations of large language models (LLMs) to induce responses that have desired properties.
Traditionally, steering has relied on supervision, such as from contrastive pairs of prompts that vary in a single target concept.
We introduce Sparse Shift Autoencoders (SSAEs) that instead map the differences between embeddings to sparse representations.
arXiv Detail & Related papers (2025-02-14T08:49:41Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Understanding Social Perception, Interactions, and Safety Aspects of Sidewalk Delivery Robots Using Sentiment Analysis [0.3069335774032178]
This article presents a comprehensive sentiment analysis (SA) of comments on YouTube videos related to Sidewalk Delivery Robots (SDRs)
We manually annotated the collected YouTube comments with three sentiment labels: negative (0), positive (1), and neutral (2).
We then constructed models for text sentiment classification and tested the models' performance on both binary and ternary classification tasks.
arXiv Detail & Related papers (2024-03-09T23:28:01Z) - Towards Reliable AI Model Deployments: Multiple Input Mixup for
Out-of-Distribution Detection [4.985768723667418]
We propose a novel and simple method to solve the Out-of-Distribution (OOD) detection problem.
Our method can help improve the OOD detection performance with only single epoch fine-tuning.
Our method does not require training the model from scratch and can be attached to the classifier simply.
arXiv Detail & Related papers (2023-12-24T15:31:51Z) - Prototype-based Aleatoric Uncertainty Quantification for Cross-modal
Retrieval [139.21955930418815]
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space.
However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts.
We propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.
arXiv Detail & Related papers (2023-09-29T09:41:19Z) - Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue.
Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z) - Accounting for multiplicity in machine learning benchmark performance [0.0]
Using the highest-ranked performance as an estimate for state-of-the-art (SOTA) performance is a biased estimator, giving overly optimistic results.
In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided.
arXiv Detail & Related papers (2023-03-10T10:32:18Z) - Scene Text Recognition with Permuted Autoregressive Sequence Models [15.118059441365343]
Context-aware STR methods typically use internal autoregressive (AR) language models (LM)
Our method, PARSeq, learns an ensemble of internal AR LMs with shared weights using Permutation Language Modeling.
It achieves context-free non-AR and context-aware AR inference, and iterative refinement using bidirectional context.
arXiv Detail & Related papers (2022-07-14T14:51:50Z) - PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition [78.67749936030219]
Prune-Adjust- Re-Prune (PARP) discovers and finetunesworks for much better ASR performance.
Experiments on low-resource English and multi-lingual ASR show sparseworks exist in pre-trained speech SSL.
arXiv Detail & Related papers (2021-06-10T17:32:25Z) - Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA)
IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors.
IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.