Capsule-Transformer for Neural Machine Translation
- URL: http://arxiv.org/abs/2004.14649v1
- Date: Thu, 30 Apr 2020 09:11:38 GMT
- Title: Capsule-Transformer for Neural Machine Translation
- Authors: Sufeng Duan, Juncheng Cao, Hai Zhao
- Abstract summary: Transformer hugely benefits from its key design of the multi-head self-attention network (SAN)
We propose the capsule-Transformer, which extends the linear transformation into a more general capsule routing algorithm.
Experimental results on the widely-used machine translation datasets show our proposed capsule-Transformer outperforms strong Transformer baseline significantly.
- Score: 73.84254045203222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer hugely benefits from its key design of the multi-head
self-attention network (SAN), which extracts information from various
perspectives through transforming the given input into different subspaces.
However, its simple linear transformation aggregation strategy may still
potentially fail to fully capture deeper contextualized information. In this
paper, we thus propose the capsule-Transformer, which extends the linear
transformation into a more general capsule routing algorithm by taking SAN as a
special case of capsule network. So that the resulted capsule-Transformer is
capable of obtaining a better attention distribution representation of the
input sequence via information aggregation among different heads and words.
Specifically, we see groups of attention weights in SAN as low layer capsules.
By applying the iterative capsule routing algorithm they can be further
aggregated into high layer capsules which contain deeper contextualized
information. Experimental results on the widely-used machine translation
datasets show our proposed capsule-Transformer outperforms strong Transformer
baseline significantly.
Related papers
- Deep multi-prototype capsule networks [0.3823356975862005]
Capsule networks are a type of neural network that identify image parts and form the instantiation parameters of a whole hierarchically.
This paper presents a multi-prototype architecture for guiding capsule networks to represent the variations in the image parts.
The experimental results on MNIST, SVHN, C-Cube, CEDAR, MCYT, and UTSig datasets reveal that the proposed model outperforms others regarding image classification accuracy.
arXiv Detail & Related papers (2024-04-23T18:37:37Z) - Why "classic" Transformers are shallow and how to make them go deep [4.520356456308492]
Key innovation in Transformer is a Self-Attention mechanism designed to capture contextual information.
extending the original Transformer design to models of greater depth has proven exceedingly challenging.
We propose a new strategy of surgically removing excessive similarity in contrast to the existing approach of diminishing the SA mechanism explicitly or implicitly.
arXiv Detail & Related papers (2023-12-11T07:49:16Z) - Inspecting Explainability of Transformer Models with Additional
Statistical Information [27.04589064942369]
Chefer et al. can visualize the Transformer on vision and multi-modal tasks effectively by combining attention layers to show the importance of each image patch.
However, when applying to other variants of Transformer such as the Swin Transformer, this method can not focus on the predicted object.
Our method, by considering the statistics of tokens in layer normalization layers, shows a great ability to interpret the explainability of Swin Transformer and ViT.
arXiv Detail & Related papers (2023-11-19T17:22:50Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - XAI for Transformers: Better Explanations through Conservative
Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction.
Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z) - DS-TransUNet:Dual Swin Transformer U-Net for Medical Image Segmentation [18.755217252996754]
We propose a novel deep medical image segmentation framework called Dual Swin Transformer U-Net (DS-TransUNet)
Unlike many prior Transformer-based solutions, the proposed DS-TransUNet first adopts dual-scale encoderworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales.
As the core component for our DS-TransUNet, a well-designed Transformer Interactive Fusion (TIF) module is proposed to effectively establish global dependencies between features of different scales through the self-attention mechanism.
arXiv Detail & Related papers (2021-06-12T08:37:17Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z) - Subspace Capsule Network [85.69796543499021]
SubSpace Capsule Network (SCN) exploits the idea of capsule networks to model possible variations in the appearance or implicitly defined properties of an entity.
SCN can be applied to both discriminative and generative models without incurring computational overhead compared to CNN during test time.
arXiv Detail & Related papers (2020-02-07T17:51:56Z) - Examining the Benefits of Capsule Neural Networks [9.658250977094562]
Capsule networks are a newly developed class of neural networks that potentially address some of the deficiencies with traditional convolutional neural networks.
By replacing the standard scalar activations with vectors, capsule networks aim to be the next great development for computer vision applications.
arXiv Detail & Related papers (2020-01-29T17:18:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.