Related papers: Transformers with Competitive Ensembles of Independent Mechanisms

Transformers with Competitive Ensembles of Independent Mechanisms

URL: http://arxiv.org/abs/2103.00336v1
Date: Sat, 27 Feb 2021 21:48:46 GMT
Title: Transformers with Competitive Ensembles of Independent Mechanisms
Authors: Alex Lamb, Di He, Anirudh Goyal, Guolin Ke, Chien-Feng Liao, Mirco Ravanelli, Yoshua Bengio
Abstract summary: We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
Score: 97.93090139318294
License: http://creativecommons.org/licenses/by/4.0/
Abstract: An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases which enable the model to keep distinct sources of information and routes of processing well-separated. This structure is linked to the notion of independent mechanisms from the causality literature, in which a mechanism is able to retain the same processing as irrelevant aspects of the world are changed. For example, convnets enable separation over positions, while attention-based architectures (especially Transformers) learn which combination of positions to process dynamically. In this work we explore a way in which the Transformer architecture is deficient: it represents each position with a large monolithic hidden representation and a single set of parameters which are applied over the entire hidden representation. This potentially throws unrelated sources of information together, and limits the Transformer's ability to capture independent mechanisms. To address this, we propose Transformers with Independent Mechanisms (TIM), a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. Additionally, we propose a competition mechanism which encourages these mechanisms to specialize over time steps, and thus be more independent. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.

Related papers

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis [8.008567379796666]
The Transformer architecture has inarguably revolutionized deep learning. At its core, the attention block differs in form and functionality from most other architectural components in deep learning. The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood.
arXiv Detail & Related papers (2024-10-14T18:15:02Z)
Disentangling and Integrating Relational and Sensory Information in Transformer Architectures [2.5322020135765464]
We distinguish between two types of information: sensory information about the properties of individual objects, and relational information about the relationships between objects. We propose an architectural extension of the Transformer framework, featuring two distinct attention mechanisms: sensory attention for directing the flow of sensory information, and a novel relational attention mechanism for directing the flow of relational information.
arXiv Detail & Related papers (2024-05-26T23:52:51Z)
Compete and Compose: Learning Independent Mechanisms for Modular World Models [57.94106862271727]
We present COMET, a modular world model which leverages reusable, independent mechanisms across different environments. COMET is trained on multiple environments with varying dynamics via a two-step process: competition and composition. We show that COMET is able to adapt to new environments with varying numbers of objects with improved sample efficiency compared to more conventional finetuning approaches.
arXiv Detail & Related papers (2024-04-23T15:03:37Z)
Transformer Mechanisms Mimic Frontostriatal Gating Operations When Trained on Human Working Memory Tasks [19.574270595733502]
We analyze the mechanisms that emerge within a vanilla attention-only Transformer trained on a simple sequence modeling task. We find that, as a result of training, the self-attention mechanism within the Transformer specializes in a way that mirrors the input and output gating mechanisms.
arXiv Detail & Related papers (2024-02-13T04:28:43Z)
Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling [10.246977481606427]
We study the mechanisms through which different components of Transformer, such as the dot-product self-attention, affect its expressive power. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads.
arXiv Detail & Related papers (2024-02-01T11:43:13Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
Properties from Mechanisms: An Equivariance Perspective on Identifiable Representation Learning [79.4957965474334]
Key goal of unsupervised representation learning is "inverting" a data generating process to recover its latent properties. This paper asks, "Can we instead identify latent properties by leveraging knowledge of the mechanisms that govern their evolution?" We provide a complete characterization of the sources of non-identifiability as we vary knowledge about a set of possible mechanisms.
arXiv Detail & Related papers (2021-10-29T14:04:08Z)
LocalViT: Analyzing Locality in Vision Transformers [101.53997555864822]
This paper studies the influence of locality mechanisms in vision transformers. We add locality to vision transformers into the feed-forward network. For ImageNet2012 classification, the locality-enhanced transformers outperform the baselines.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture. Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z)
Self-Attention Attribution: Interpreting Information Interactions Inside Transformer [89.21584915290319]
We propose a self-attention attribution method to interpret the information interactions inside Transformer. We show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.
arXiv Detail & Related papers (2020-04-23T14:58:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.