Route Sparse Autoencoder to Interpret Large Language Models
- URL: http://arxiv.org/abs/2503.08200v2
- Date: Thu, 10 Apr 2025 02:58:01 GMT
- Title: Route Sparse Autoencoder to Interpret Large Language Models
- Authors: Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Guojun Ma, Xiang Wang, Xiangnan He,
- Abstract summary: Route Sparse Autoencoder (RouteSAE) is a framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers.<n>Under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score.
- Score: 33.44362399988847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable and monosemantic features. However, prior works primarily focus on feature extraction from a single layer, failing to effectively capture activations that span multiple layers. In this paper, we introduce Route Sparse Autoencoder (RouteSAE), a new framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. It dynamically assigns weights to activations from different layers, incurring minimal parameter overhead while achieving high interpretability and flexibility for targeted feature manipulation. We evaluate RouteSAE through extensive experiments on Llama-3.2-1B-Instruct. Specifically, under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score. These results underscore the potential of RouteSAE as a scalable and effective method for LLM interpretability, with applications in feature discovery and model intervention. Our codes are available at https://github.com/swei2001/RouteSAEs.
Related papers
- R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.
Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z) - Sparse Autoencoder Features for Classifications and Transferability [11.2185030332009]
We analyze Sparse Autoencoders (SAEs) for interpretable feature extraction from Large Language Models (LLMs)
Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations.
arXiv Detail & Related papers (2025-02-17T02:30:45Z) - Sparse Autoencoders Trained on the Same Data Learn Different Features [0.7234862895932991]
Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in large language models.<n>Our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features.
arXiv Detail & Related papers (2025-01-28T01:24:16Z) - SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space.<n>We build an open-source pipeline to generate and evaluate natural language explanations for SAE features.<n>Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z) - SparseDet: A Simple and Effective Framework for Fully Sparse LiDAR-based 3D Object Detection [14.137208879448389]
LiDAR-based sparse 3D object detection plays a crucial role in autonomous driving applications.
Existing methods either use the features of a single central voxel as an object proxy, or treat an aggregated cluster of foreground points as an object proxy.
We propose SparseDet which designs sparse queries as object proxies.
arXiv Detail & Related papers (2024-06-16T12:11:44Z) - Salient Object Detection in Optical Remote Sensing Images Driven by
Transformer [69.22039680783124]
We propose a novel Global Extraction Local Exploration Network (GeleNet) for Optical Remote Sensing Images (ORSI-SOD)
Specifically, GeleNet first adopts a transformer backbone to generate four-level feature embeddings with global long-range dependencies.
Extensive experiments on three public datasets demonstrate that the proposed GeleNet outperforms relevant state-of-the-art methods.
arXiv Detail & Related papers (2023-09-15T07:14:43Z) - Learning to Generate Content-Aware Dynamic Detectors [62.74209921174237]
We introduce a newpective of designing efficient detectors, which is automatically generating sample-adaptive model architecture.
We introduce a course-to-fine strat-egy tailored for object detection to guide the learning of dynamic routing.
Experiments on MS-COCO dataset demonstrate that CADDet achieves 1.8 higher mAP with 10% fewer FLOPs compared with vanilla routing.
arXiv Detail & Related papers (2020-12-08T08:05:20Z) - LiDAR-based Panoptic Segmentation via Dynamic Shifting Network [56.71765153629892]
LiDAR-based panoptic segmentation aims to parse both objects and scenes in a unified manner.
We propose the Dynamic Shifting Network (DS-Net), which serves as an effective panoptic segmentation framework in the point cloud realm.
Our proposed DS-Net achieves superior accuracies over current state-of-the-art methods.
arXiv Detail & Related papers (2020-11-24T08:44:46Z) - Lightweight Single-Image Super-Resolution Network with Attentive
Auxiliary Feature Learning [73.75457731689858]
We develop a computation efficient yet accurate network based on the proposed attentive auxiliary features (A$2$F) for SISR.
Experimental results on large-scale dataset demonstrate the effectiveness of the proposed model against the state-of-the-art (SOTA) SR methods.
arXiv Detail & Related papers (2020-11-13T06:01:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.