Mechanistic Interpretability of Antibody Language Models Using SAEs
- URL: http://arxiv.org/abs/2512.05794v1
- Date: Fri, 05 Dec 2025 15:18:50 GMT
- Title: Mechanistic Interpretability of Antibody Language Models Using SAEs
- Authors: Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Charlotte M. Deane,
- Abstract summary: We employparse autoencoders (SAEs) to provide insight into learned concepts within large protein language models.<n>TopK SAEs can reveal biologically meaningful latent features, but high feature concept correlation does not guarantee causal control over generation.<n>Ordered SAEs impose an hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns.
- Score: 1.7218681244575125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate an autoregressive antibody language model, p-IgGen, and steer its generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose an hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs are sufficient for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.
Related papers
- When the Coffee Feature Activates on Coffins: An Analysis of Feature Extraction and Steering for Mechanistic Interpretability [0.0]
Recent work by Anthropic on Mechanistic interpretability claims to understand and control Large Language Models.<n>We conduct an initial stress-test of these claims by replicating their main results with open-source SAEs for Llama 3.1.<n>We find that feature steering exhibits substantial fragility, with sensitivity to layer selection, steering magnitude, and context.
arXiv Detail & Related papers (2026-01-06T14:29:51Z) - Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models [82.79223371188756]
Chain-of-Thought (CoT) prompting has advanced task-solving capabilities in natural language processing with large language models.<n>Applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible.<n>We introduce pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning.
arXiv Detail & Related papers (2025-12-24T05:25:17Z) - Re-envisioning Euclid Galaxy Morphology: Identifying and Interpreting Features with Sparse Autoencoders [0.14323566945483496]
Sparse Autoencoders (SAEs) can efficiently identify candidate monosemantic features from pretrained neural networks for galaxy morphology.<n>We demonstrate this on Euclid Q1 images using both supervised (Zoobot) and new self-supervised (MAE) models.
arXiv Detail & Related papers (2025-10-27T18:28:56Z) - ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders [30.219733023958188]
Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models.<n>We propose a semantically-guided SAE, called ProtSAE.<n>We show that ProtSAE learns more biologically relevant and interpretable hidden features compared to previous methods.
arXiv Detail & Related papers (2025-08-26T11:20:31Z) - TopK Language Models [23.574227495324568]
TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability.<n>These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts.
arXiv Detail & Related papers (2025-06-26T16:56:43Z) - Dense SAE Latents Are Features, Not Bugs [86.50389855919292]
We show that dense latents serve functional roles in language model computation.<n>We identify classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction.
arXiv Detail & Related papers (2025-06-18T17:59:35Z) - Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models [50.587868616659826]
We introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in vision representations.<n>Our experimental results reveal that SAEs trained on Vision-Language Models significantly enhance the monosemanticity of individual neurons.
arXiv Detail & Related papers (2025-04-03T17:58:35Z) - Towards Interpretable Protein Structure Prediction with Sparse Autoencoders [0.0]
Matryoshka SAEs learn hierarchically organized features by forcing nested groups of latents to reconstruct inputs independently.<n>We scale SAEs to ESM2-3B, the base model for ESMFold, enabling mechanistic interpretability of protein structure prediction for the first time.<n>We show that SAEs trained on ESM2-3B significantly outperform those trained on smaller models for both biological concept discovery and contact map prediction.
arXiv Detail & Related papers (2025-03-11T17:57:29Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - Can sparse autoencoders make sense of gene expression latent variable models? [0.0]
This work explores the potential of SAEs for decomposing embeddings in complex and high-dimensional biological data.<n>The application to embeddings from pretrained single-cell models shows that SAEs can find and steer key biological processes.<n> scFeatureLens is an automated interpretability approach for linking SAE features and biological concepts from gene sets.
arXiv Detail & Related papers (2024-10-15T10:16:01Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings and reasoning mechanisms is a significant challenge.<n>We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.<n>We demonstrate that generative models like GPT can accurately learn and reason over CFG-defined hierarchies and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.