Related papers: Prototype Transformer: Towards Language Model Architectures Interpretable by Design

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

URL: http://arxiv.org/abs/2602.11852v1
Date: Thu, 12 Feb 2026 11:43:39 GMT
Title: Prototype Transformer: Towards Language Model Architectures Interpretable by Design
Authors: Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus Kaltenberger, Amine M'Charrak, Tommaso Salvatori, Thomas Lukasiewicz,
Abstract summary: We introduce the Prototype Transformer (ProtoT) -- an autoregressive LM architecture based on prototypes.<n>ProtoT works by means of two-way communication between the input sequence and the prototypes.<n>It provides the potential to interpret the model's reasoning and allow for targeted edits of its behavior.
Score: 37.30649990861446
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While state-of-the-art language models (LMs) surpass the vast majority of humans in certain domains, their reasoning remains largely opaque, undermining trust in their output. Furthermore, while autoregressive LMs can output explicit reasoning, their true reasoning process is opaque, which introduces risks like deception and hallucination. In this work, we introduce the Prototype Transformer (ProtoT) -- an autoregressive LM architecture based on prototypes (parameter vectors), posed as an alternative to the standard self-attention-based transformers. ProtoT works by means of two-way communication between the input sequence and the prototypes, and we show that this leads to the prototypes automatically capturing nameable concepts (e.g. "woman") during training. They provide the potential to interpret the model's reasoning and allow for targeted edits of its behavior. Furthermore, by design, the prototypes create communication channels that aggregate contextual information at different time scales, aiding interpretability. In terms of computation scalability, ProtoT scales linearly with sequence length vs the quadratic scalability of SOTA self-attention transformers. Compared to baselines, ProtoT scales well with model and data size, and performs well on text generation and downstream tasks (GLUE). ProtoT exhibits robustness to input perturbations on par or better than some baselines, but differs from them by providing interpretable pathways showing how robustness and sensitivity arises. Reaching close to the performance of state-of-the-art architectures, ProtoT paves the way to creating well-performing autoregressive LMs interpretable by design.

Related papers

PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z)
Parameter-Efficient Transformer Embeddings [0.0]
We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs.<n>We train standard transformers and our architecture on natural language inference tasks.<n>Our results demonstrate that the proposed method competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout.
arXiv Detail & Related papers (2025-05-04T21:47:18Z)
Sparse Prototype Network for Explainable Pedestrian Behavior Prediction [60.80524827122901]
We present Sparse Prototype Network (SPN), an explainable method designed to simultaneously predict a pedestrian's future action, trajectory, and pose. Regularized by mono-semanticity and clustering constraints, the prototypes learn consistent and human-understandable features.
arXiv Detail & Related papers (2024-10-16T03:33:40Z)
The Mysterious Case of Neuron 1512: Injectable Realignment Architectures Reveal Internal Characteristics of Meta's Llama 2 Model [3.838217057990932]
Injectable Realignment Model (IRM) is a novel approach to language model interpretability and explainability. Inspired by earlier work on Neural Programming Interfaces, we construct and train a small network -- the IRM -- to induce emotion-based alignments. Analysis of the trained IRM's outputs reveals a curious pattern.
arXiv Detail & Related papers (2024-07-04T04:05:19Z)
Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models [19.683461002518147]
Test-Time Prototype Shifting (TPS) is a pioneering approach designed to adapt vision-language models to test datasets using unlabeled test inputs.<n>TPS not only facilitates optimization-free prototype reuse for subsequent predictions but also enables seamless integration with current advancements in prompt engineering.<n>A notable aspect of our framework is its significantly reduced memory and computational demands when compared to conventional text-prompt tuning methods.
arXiv Detail & Related papers (2024-03-19T17:54:34Z)
Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z)
Categorical Traffic Transformer: Interpretable and Diverse Behavior Prediction with Tokenized Latent [17.14501241048221]
We present Categorical Traffic Transformer (CTT), a traffic model that outputs both continuous trajectory predictions and tokenized categorical predictions. The most outstanding feature of CTT is its fully interpretable latent space, which enables direct supervision of the latent variable from the ground truth. As a result, CTT can generate diverse behaviors conditioned on different latent modes with semantic meanings while beating SOTA on prediction accuracy.
arXiv Detail & Related papers (2023-11-30T07:25:24Z)
Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order. In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z)
VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers [45.42482446288144]
Recent advances in interpretability suggest we can project weights and hidden states of transformer-based language models to their vocabulary. We investigate LM attention heads and memory values, the vectors the models dynamically create and recall while processing a given input. We create a tool to visualize a forward pass of Generative Pre-trained Transformers (GPTs) as an interactive flow graph.
arXiv Detail & Related papers (2023-05-22T19:04:56Z)
Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex. This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.