Transformers with Competitive Ensembles of Independent Mechanisms
- URL: http://arxiv.org/abs/2103.00336v1
- Date: Sat, 27 Feb 2021 21:48:46 GMT
- Title: Transformers with Competitive Ensembles of Independent Mechanisms
- Authors: Alex Lamb, Di He, Anirudh Goyal, Guolin Ke, Chien-Feng Liao, Mirco
Ravanelli, Yoshua Bengio
- Abstract summary: We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention.
We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
- Score: 97.93090139318294
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An important development in deep learning from the earliest MLPs has been a
move towards architectures with structural inductive biases which enable the
model to keep distinct sources of information and routes of processing
well-separated. This structure is linked to the notion of independent
mechanisms from the causality literature, in which a mechanism is able to
retain the same processing as irrelevant aspects of the world are changed. For
example, convnets enable separation over positions, while attention-based
architectures (especially Transformers) learn which combination of positions to
process dynamically. In this work we explore a way in which the Transformer
architecture is deficient: it represents each position with a large monolithic
hidden representation and a single set of parameters which are applied over the
entire hidden representation. This potentially throws unrelated sources of
information together, and limits the Transformer's ability to capture
independent mechanisms. To address this, we propose Transformers with
Independent Mechanisms (TIM), a new Transformer layer which divides the hidden
representation and parameters into multiple mechanisms, which only exchange
information through attention. Additionally, we propose a competition mechanism
which encourages these mechanisms to specialize over time steps, and thus be
more independent. We study TIM on a large-scale BERT model, on the Image
Transformer, and on speech enhancement and find evidence for semantically
meaningful specialization as well as improved performance.
Related papers
- What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis [8.008567379796666]
The Transformer architecture has inarguably revolutionized deep learning.
At its core, the attention block differs in form and functionality from most other architectural components in deep learning.
The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood.
arXiv Detail & Related papers (2024-10-14T18:15:02Z) - Disentangling and Integrating Relational and Sensory Information in Transformer Architectures [2.5322020135765464]
We distinguish between two types of information: sensory information about the properties of individual objects, and relational information about the relationships between objects.
We propose an architectural extension of the Transformer framework, featuring two distinct attention mechanisms: sensory attention for directing the flow of sensory information, and a novel relational attention mechanism for directing the flow of relational information.
arXiv Detail & Related papers (2024-05-26T23:52:51Z) - Compete and Compose: Learning Independent Mechanisms for Modular World Models [57.94106862271727]
We present COMET, a modular world model which leverages reusable, independent mechanisms across different environments.
COMET is trained on multiple environments with varying dynamics via a two-step process: competition and composition.
We show that COMET is able to adapt to new environments with varying numbers of objects with improved sample efficiency compared to more conventional finetuning approaches.
arXiv Detail & Related papers (2024-04-23T15:03:37Z) - Transformer Mechanisms Mimic Frontostriatal Gating Operations When
Trained on Human Working Memory Tasks [19.574270595733502]
We analyze the mechanisms that emerge within a vanilla attention-only Transformer trained on a simple sequence modeling task.
We find that, as a result of training, the self-attention mechanism within the Transformer specializes in a way that mirrors the input and output gating mechanisms.
arXiv Detail & Related papers (2024-02-13T04:28:43Z) - Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling [10.246977481606427]
We study the mechanisms through which different components of Transformer, such as the dot-product self-attention, affect its expressive power.
Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads.
arXiv Detail & Related papers (2024-02-01T11:43:13Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Properties from Mechanisms: An Equivariance Perspective on Identifiable
Representation Learning [79.4957965474334]
Key goal of unsupervised representation learning is "inverting" a data generating process to recover its latent properties.
This paper asks, "Can we instead identify latent properties by leveraging knowledge of the mechanisms that govern their evolution?"
We provide a complete characterization of the sources of non-identifiability as we vary knowledge about a set of possible mechanisms.
arXiv Detail & Related papers (2021-10-29T14:04:08Z) - Generic Attention-model Explainability for Interpreting Bi-Modal and
Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture.
Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z) - Self-Attention Attribution: Interpreting Information Interactions Inside
Transformer [89.21584915290319]
We propose a self-attention attribution method to interpret the information interactions inside Transformer.
We show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.
arXiv Detail & Related papers (2020-04-23T14:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.