Related papers: Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention

Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention

URL: http://arxiv.org/abs/2207.13354v1
Date: Wed, 27 Jul 2022 08:20:00 GMT
Title: Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention
Authors: Mengsay Loem, Sho Takase, Masahiro Kaneko and Naoaki Okazaki
Abstract summary: We show that replacing self-attention in Transformer with multi-head neural $n$-gram can achieve comparable or better performance than Transformer. From various analyses on our proposed method, we find that multi-head neural $n$-gram is complementary to self-attention.
Score: 27.850970793739933
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Impressive performance of Transformer has been attributed to self-attention, where dependencies between entire input in a sequence are considered at every position. In this work, we reform the neural $n$-gram model, which focuses on only several surrounding representations of each position, with the multi-head mechanism as in Vaswani et al.(2017). Through experiments on sequence-to-sequence tasks, we show that replacing self-attention in Transformer with multi-head neural $n$-gram can achieve comparable or better performance than Transformer. From various analyses on our proposed method, we find that multi-head neural $n$-gram is complementary to self-attention, and their combinations can further improve performance of vanilla Transformer.

Related papers

Spiking Transformer:Introducing Accurate Addition-Only Spiking Self-Attention for Transformer [15.93436166506258]
Spiking Neural Networks have emerged as a promising energy-efficient alternative to traditional Artificial Neural Networks. This paper introduces Accurate Addition-Only Spiking Self-Attention (A$2$OS$2$A)
arXiv Detail & Related papers (2025-02-28T22:23:29Z)
Multiset Transformer: Advancing Representation Learning in Persistence Diagrams [11.512742322405906]
Multiset Transformer is a neural network that utilizes attention mechanisms specifically designed for multisets as inputs. The architecture integrates multiset-enhanced attentions with a pool-decomposition scheme, allowing multiplicities to be preserved across equivariant layers. Experimental results demonstrate that the Multiset Transformer outperforms existing neural network methods in the realm of persistence diagram representation learning.
arXiv Detail & Related papers (2024-11-22T01:38:47Z)
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads [10.169639612525643]
We propose a new multi-head self-attention (MHSA) variant named Fibottention, which can replace MHSA in Transformer architectures. Fibottention is data-efficient and computationally more suitable for processing large numbers of tokens than the standard MHSA. It employs structured sparse attention based on dilated Fibonacci sequences, which, uniquely, differ across attention heads, resulting in-like diverse features across heads.
arXiv Detail & Related papers (2024-06-27T17:59:40Z)
Sampled Transformer for Point Sets [80.66097006145999]
sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions. We propose an $O(n)$ complexity sampled transformer that can process point set elements directly without any additional inductive bias.
arXiv Detail & Related papers (2023-02-28T06:38:05Z)
Redesigning the Transformer Architecture with Insights from Multi-particle Dynamical Systems [32.86421107987556]
We build upon recent developments in analyzing deep neural networks as numerical solvers of ordinary differential equations. We formulate a temporal evolution scheme, TransEvolve, to bypass costly dot-product attention over multiple stacked layers. We perform exhaustive experiments with TransEvolve on well-known encoder-decoder as well as encoder-only tasks.
arXiv Detail & Related papers (2021-09-30T14:01:06Z)
UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation [6.646135062704341]
Transformer architecture has emerged to be successful in a number of natural language processing tasks. We present UTNet, a powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation.
arXiv Detail & Related papers (2021-07-02T00:56:27Z)
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z)
Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure. Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z)
Multi-branch Attentive Transformer [152.07840447196384]
We propose a simple yet effective variant of Transformer called multi-branch attentive Transformer. The attention layer is the average of multiple branches and each branch is an independent multi-head attention layer. Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements.
arXiv Detail & Related papers (2020-06-18T04:24:28Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.