Related papers: Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

Related papers

Selective Induction Heads: How Transformers Select Causal Structures In Context [50.09964990342878]
We introduce a novel framework that showcases transformers' ability to handle causal structures.<n>Our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed.<n>This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context.
arXiv Detail & Related papers (2025-09-09T23:13:41Z)
Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models [12.112842686827669]
Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks.<n>We investigate the layerwise behavior of Transformers to uncover the mechanisms underlying their multi-task generalization ability.<n>Our explicit constructions align closely with empirical observations, providing theoretical support for the Transformer's effectiveness and efficiency on sequence learning across diverse tasks.
arXiv Detail & Related papers (2025-06-02T17:39:31Z)
Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning [30.781578037476347]
We introduce a novel approach to modeling transformer architectures using highly flexible non-autonomous neural ordinary differential equations (ODEs) Our proposed model parameterizes all weights of attention and feed-forward blocks through neural networks, expressing these weights as functions of a continuous layer index. Our neural ODE transformer demonstrates performance comparable to or better than vanilla transformers across various configurations and datasets.
arXiv Detail & Related papers (2025-03-03T09:12:14Z)
Enhancing Transformers for Generalizable First-Order Logical Entailment [51.04944136538266]
This paper investigates the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge. The first-order reasoning capability of transformers is assessed through their ability to perform first-order logical entailment. We propose a more sophisticated, logic-aware architecture, TEGA, to enhance the capability for generalizable first-order logical entailment in transformers.
arXiv Detail & Related papers (2025-01-01T07:05:32Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis [8.008567379796666]
The Transformer architecture has inarguably revolutionized deep learning. At its core, the attention block differs in form and functionality from most other architectural components in deep learning. The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood.
arXiv Detail & Related papers (2024-10-14T18:15:02Z)
Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models. SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z)
Disentangling and Integrating Relational and Sensory Information in Transformer Architectures [2.5322020135765464]
We distinguish between two types of information: sensory information about the properties of individual objects, and relational information about the relationships between objects. We propose an architectural extension of the Transformer framework, featuring two distinct attention mechanisms: sensory attention for directing the flow of sensory information, and a novel relational attention mechanism for directing the flow of relational information.
arXiv Detail & Related papers (2024-05-26T23:52:51Z)
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task [14.921790126851008]
We present a comprehensive mechanistic analysis of a transformer trained on a synthetic reasoning task. We identify a set of interpretable mechanisms the model uses to solve the task, and validate our findings using correlational and causal evidence.
arXiv Detail & Related papers (2024-02-19T08:04:25Z)
How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations. We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z)
Transformers are Universal Predictors [21.92580010179886]
We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We analyze performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training.
arXiv Detail & Related papers (2023-07-15T16:19:37Z)
Approximation Rate of the Transformer Architecture for Sequence Modeling [18.166959969957315]
We consider a class of non-linear relationships and identify a novel notion of complexity measures to establish an explicit Jackson-type approximation rate estimate for the Transformer. This rate reveals the structural properties of the Transformer and suggests the types of sequential relationships it is best suited for approximating.
arXiv Detail & Related papers (2023-05-29T10:56:36Z)
XAI for Transformers: Better Explanations through Conservative Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.