Related papers: DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

URL: http://arxiv.org/abs/2602.05859v1
Date: Thu, 05 Feb 2026 16:41:25 GMT
Title: DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
Authors: Xu Wang, Bingqing Jiang, Yu Wan, Baosong Yang, Lingpeng Kong, Difan Zou,
Abstract summary: We present DLM-Scope, the first SAE-based interpretability framework for diffusion language models.<n>We show that trained Top-K SAEs can faithfully extract interpretable features.<n>We also show a great potential of applying SAEs to DLM-related tasks and algorithms.
Score: 73.18745837755758
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, human-interpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly promising alternative to the autoregressive LLMs, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we present DLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract interpretable features. Notably, we find that inserting SAEs affects DLMs differently than autoregressive LLMs: while SAE insertion in LLMs typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly weaker in LLMs. Additionally, SAE features in DLMs enable more effective diffusion-time interventions, often outperforming LLM steering. Moreover, we pioneer certain new SAE-based research directions for DLMs: we show that SAEs can provide useful signals for DLM decoding order; and the SAE features are stable during the post-training phase of DLMs. Our work establishes a foundation for mechanistic interpretability in DLMs and shows a great potential of applying SAEs to DLM-related tasks and algorithms.

Related papers

Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models [82.87985794856803]
Large Language Models (LLMs) have achieved state-of-the-art performance on a broad range of Natural Language Processing (NLP) tasks.<n>Recently, Diffusion Language Models (DLMs) have emerged as a promising alternative architecture.
arXiv Detail & Related papers (2025-10-05T10:50:52Z)
Inverse Language Modeling towards Robust and Grounded LLMs [3.3072037841206345]
We propose Inverse Language Modeling (ILM), a unified framework that improves the robustness of LLMs to input perturbations.<n>ILM transforms LLMs from static generators into analyzable and robust systems.<n>ILM can lay the foundation for next-generation LLMs that are not only robust and grounded but also fundamentally more controllable.
arXiv Detail & Related papers (2025-10-02T11:47:18Z)
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs [63.580867975515474]
We present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs.<n>We propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation.
arXiv Detail & Related papers (2025-06-17T11:45:37Z)
SAE-V: Interpreting Multimodal Models for Enhanced Alignment [7.374787098456952]
We introduce SAE-V, a mechanistic interpretability framework that extends the SAE paradigm to multimodal large language models.<n> SAE-V provides an intrinsic data filtering mechanism to enhance model alignment without requiring additional models.<n>Our results highlight SAE-V's ability to enhance interpretability and alignment in MLLMs, providing insights into their internal mechanisms.
arXiv Detail & Related papers (2025-02-22T14:20:07Z)
LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.<n>LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.<n>Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization [12.418844515095035]
Large Language Models (LLMs) tend to produce inaccurate responses to specific queries.<n> incorrect tokenization is the critical point that hinders LLMs in understanding the input precisely.<n>We construct an adversarial dataset, named as $textbfADT (Adrial dataset for Tokenizer)$, which draws upon the vocabularies of various open-source LLMs to challenge LLMs' tokenization.
arXiv Detail & Related papers (2024-05-27T11:39:59Z)
Rethinking Interpretability in the Era of Large Language Models [76.1947554386879]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks. The capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human. These new capabilities raise new challenges, such as hallucinated explanations and immense computational costs.
arXiv Detail & Related papers (2024-01-30T17:38:54Z)
Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling [0.0]
Recent decoder-only large language models (LLMs) perform on par with smaller state-based encoders. We explore techniques for improving the SL performance of open LLMs on IE tasks by applying layer-wise removal of the causal mask. Our findings hold for diverse SL tasks, demonstrating that open LLMs with layer-dependent CM removal outperform strong-based encoders and even instruction-tuned LLMs.
arXiv Detail & Related papers (2024-01-25T22:50:48Z)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN) At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.