Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to
Self-attention
- URL: http://arxiv.org/abs/2207.13354v1
- Date: Wed, 27 Jul 2022 08:20:00 GMT
- Title: Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to
Self-attention
- Authors: Mengsay Loem, Sho Takase, Masahiro Kaneko and Naoaki Okazaki
- Abstract summary: We show that replacing self-attention in Transformer with multi-head neural $n$-gram can achieve comparable or better performance than Transformer.
From various analyses on our proposed method, we find that multi-head neural $n$-gram is complementary to self-attention.
- Score: 27.850970793739933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Impressive performance of Transformer has been attributed to self-attention,
where dependencies between entire input in a sequence are considered at every
position. In this work, we reform the neural $n$-gram model, which focuses on
only several surrounding representations of each position, with the multi-head
mechanism as in Vaswani et al.(2017). Through experiments on
sequence-to-sequence tasks, we show that replacing self-attention in
Transformer with multi-head neural $n$-gram can achieve comparable or better
performance than Transformer. From various analyses on our proposed method, we
find that multi-head neural $n$-gram is complementary to self-attention, and
their combinations can further improve performance of vanilla Transformer.
Related papers
- Multiset Transformer: Advancing Representation Learning in Persistence Diagrams [11.512742322405906]
Multiset Transformer is a neural network that utilizes attention mechanisms specifically designed for multisets as inputs.
The architecture integrates multiset-enhanced attentions with a pool-decomposition scheme, allowing multiplicities to be preserved across equivariant layers.
Experimental results demonstrate that the Multiset Transformer outperforms existing neural network methods in the realm of persistence diagram representation learning.
arXiv Detail & Related papers (2024-11-22T01:38:47Z) - Sampled Transformer for Point Sets [80.66097006145999]
sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions.
We propose an $O(n)$ complexity sampled transformer that can process point set elements directly without any additional inductive bias.
arXiv Detail & Related papers (2023-02-28T06:38:05Z) - Redesigning the Transformer Architecture with Insights from
Multi-particle Dynamical Systems [32.86421107987556]
We build upon recent developments in analyzing deep neural networks as numerical solvers of ordinary differential equations.
We formulate a temporal evolution scheme, TransEvolve, to bypass costly dot-product attention over multiple stacked layers.
We perform exhaustive experiments with TransEvolve on well-known encoder-decoder as well as encoder-only tasks.
arXiv Detail & Related papers (2021-09-30T14:01:06Z) - UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation [6.646135062704341]
Transformer architecture has emerged to be successful in a number of natural language processing tasks.
We present UTNet, a powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation.
arXiv Detail & Related papers (2021-07-02T00:56:27Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z) - IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure.
Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z) - Multi-branch Attentive Transformer [152.07840447196384]
We propose a simple yet effective variant of Transformer called multi-branch attentive Transformer.
The attention layer is the average of multiple branches and each branch is an independent multi-head attention layer.
Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements.
arXiv Detail & Related papers (2020-06-18T04:24:28Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.