Related papers: Extracting Rule-based Descriptions of Attention Features in Transformers

Extracting Rule-based Descriptions of Attention Features in Transformers

URL: http://arxiv.org/abs/2510.18148v1
Date: Mon, 20 Oct 2025 22:52:40 GMT
Title: Extracting Rule-based Descriptions of Attention Features in Transformers
Authors: Dan Friedman, Adithya Bhaskar, Alexander Wettig, Danqi Chen,
Abstract summary: We study rule-based descriptions of SAE features trained on the outputs of attention layers.<n>We find that a majority of features may be described well with around 100 skip-gram rules.<n>This paper lays the groundwork for future research into rule-based descriptions of features.
Score: 68.33953232728204
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a sparse linear combination of basis vectors, called features. However, this only identifies which text sequences (exemplars) activate which features; the actual interpretation of features requires subjective inspection of these exemplars. This paper advocates for a different solution: rule-based descriptions that match token patterns in the input and correspondingly increase or decrease the likelihood of specific output tokens. Specifically, we extract rule-based descriptions of SAE features trained on the outputs of attention layers. While prior work treats the attention layers as an opaque box, we describe how it may naturally be expressed in terms of interactions between input and output features, of which we study three types: (1) skip-gram rules of the form "[Canadian city]... speaks --> English", (2) absence rules of the form "[Montreal]... speaks -/-> English," and (3) counting rules that toggle only when the count of a word exceeds a certain value or the count of another word. Absence and counting rules are not readily discovered by inspection of exemplars, where manual and automatic descriptions often identify misleading or incomplete explanations. We then describe a simple approach to extract these types of rules automatically from a transformer, and apply it to GPT-2 small. We find that a majority of features may be described well with around 100 skip-gram rules, though absence rules are abundant even as early as the first layer (in over a fourth of features). We also isolate a few examples of counting rules. This paper lays the groundwork for future research into rule-based descriptions of features by defining them, showing how they may be extracted, and providing a preliminary taxonomy of some of the behaviors they represent.

Related papers

Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs [0.8135825089247968]
We explore the potential of large language models to generate natural language explanations for logical rules.<n>Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237.<n>We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning.
arXiv Detail & Related papers (2025-07-31T17:24:04Z)
Neuro-Symbolic Temporal Point Processes [13.72758658973969]
We introduce a neural-symbolic rule induction framework within the temporal point process model. The negative log-likelihood is the loss that guides the learning, where the explanatory logic rules and their weights are learned end-to-end. Our approach showcases notable efficiency and accuracy across synthetic and real datasets.
arXiv Detail & Related papers (2024-06-06T09:52:56Z)
Can Language Models Explain Their Own Classification Behavior? [1.8177391253202122]
Large language models (LLMs) perform well at a myriad of tasks, but explaining the processes behind this performance is a challenge. This paper investigates whether LLMs can give faithful high-level explanations of their own internal processes. We release our dataset, ArticulateRules, which can be used to test self-explanation for LLMs trained either in-context or by finetuning.
arXiv Detail & Related papers (2024-05-13T02:31:08Z)
Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks [6.390468088226495]
We propose a new method to extract and explore significant fine-grained grammar patterns from treebanks. We extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order. Our method captures both well-known and less well-known significant grammar rules in Spanish, French, and Wolof.
arXiv Detail & Related papers (2024-03-26T09:39:53Z)
Rule-driven News Captioning [33.145889362997316]
News captioning task aims to generate sentences by describing named entities or concrete events for an image with its news article. Existing methods have achieved remarkable results by relying on the large-scale pre-trained models. We propose the rule-driven news captioning method, which can generate image descriptions following designated rule signal.
arXiv Detail & Related papers (2024-03-08T07:06:43Z)
ChatRule: Mining Logical Rules with Large Language Models for Knowledge Graph Reasoning [107.61997887260056]
We propose a novel framework, ChatRule, unleashing the power of large language models for mining logical rules over knowledge graphs. Specifically, the framework is initiated with an LLM-based rule generator, leveraging both the semantic and structural information of KGs. To refine the generated rules, a rule ranking module estimates the rule quality by incorporating facts from existing KGs.
arXiv Detail & Related papers (2023-09-04T11:38:02Z)
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models [55.60306377044225]
"SelfCheckGPT" is a simple sampling-based approach to fact-check the responses of black-box models. We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset.
arXiv Detail & Related papers (2023-03-15T19:31:21Z)
RulE: Knowledge Graph Reasoning with Rule Embedding [69.31451649090661]
We propose a principled framework called textbfRulE (stands for Rule Embedding) to leverage logical rules to enhance KG reasoning. RulE learns rule embeddings from existing triplets and first-order rules by jointly representing textbfentities, textbfrelations and textbflogical rules in a unified embedding space. Results on multiple benchmarks reveal that our model outperforms the majority of existing embedding-based and rule-based approaches.
arXiv Detail & Related papers (2022-10-24T06:47:13Z)
An Exploration And Validation of Visual Factors in Understanding Classification Rule Sets [21.659381756612866]
Rule sets are often used in Machine Learning (ML) as a way to communicate the model logic in settings where transparency and intelligibility are necessary. Surprisingly, to date there has been limited work on exploring visual alternatives for presenting rules. This work can help practitioners employ more effective solutions when using rules as a communication strategy to understand ML models.
arXiv Detail & Related papers (2021-09-19T16:33:16Z)
Contrastive Explanations for Model Interpretability [77.92370750072831]
We propose a methodology to produce contrastive explanations for classification models. Our method is based on projecting model representation to a latent space. Our findings shed light on the ability of label-contrastive explanations to provide a more accurate and finer-grained interpretability of a model's decision.
arXiv Detail & Related papers (2021-03-02T00:36:45Z)
Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text. We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages. We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.