Multi-branch Attentive Transformer
- URL: http://arxiv.org/abs/2006.10270v2
- Date: Sun, 26 Jul 2020 13:04:49 GMT
- Title: Multi-branch Attentive Transformer
- Authors: Yang Fan, Shufang Xie, Yingce Xia, Lijun Wu, Tao Qin, Xiang-Yang Li,
Tie-Yan Liu
- Abstract summary: We propose a simple yet effective variant of Transformer called multi-branch attentive Transformer.
The attention layer is the average of multiple branches and each branch is an independent multi-head attention layer.
Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements.
- Score: 152.07840447196384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While the multi-branch architecture is one of the key ingredients to the
success of computer vision tasks, it has not been well investigated in natural
language processing, especially sequence learning tasks. In this work, we
propose a simple yet effective variant of Transformer called multi-branch
attentive Transformer (briefly, MAT), where the attention layer is the average
of multiple branches and each branch is an independent multi-head attention
layer. We leverage two training techniques to regularize the training:
drop-branch, which randomly drops individual branches during training, and
proximal initialization, which uses a pre-trained Transformer model to
initialize multiple branches. Experiments on machine translation, code
generation and natural language understanding demonstrate that such a simple
variant of Transformer brings significant improvements. Our code is available
at \url{https://github.com/HA-Transformer}.
Related papers
- iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - TranSFormer: Slow-Fast Transformer for Machine Translation [52.12212173775029]
We present a textbfSlow-textbfFast two-stream learning model, referred to as TrantextbfSFormer.
Our TranSFormer shows consistent BLEU improvements (larger than 1 BLEU point) on several machine translation benchmarks.
arXiv Detail & Related papers (2023-05-26T14:37:38Z) - Semantic-visual Guided Transformer for Few-shot Class-incremental
Learning [6.300141694311465]
We develop a semantic-visual guided Transformer (SV-T) to enhance the feature extracting capacity of the pre-trained feature backbone on incremental classes.
Our SV-T can take full advantage of more supervision information from base classes and further enhance the training robustness of the feature backbone.
arXiv Detail & Related papers (2023-03-27T15:06:49Z) - Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments.
Our method consists of a multimodal transformer that encodes visual observations and language instructions.
We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z) - Transformer over Pre-trained Transformer for Neural Text Segmentation
with Enhanced Topic Coherence [6.73258176462356]
It consists of two components: bottom-level sentence encoders using pre-trained transformers, and an upper-level transformer-based segmentation model based on the sentence embeddings.
Our experiments show that Transformer$2$ manages to surpass state-of-the-art text segmentation models in terms of a commonly-used semantic coherence measure.
arXiv Detail & Related papers (2021-10-14T05:26:39Z) - Redesigning the Transformer Architecture with Insights from
Multi-particle Dynamical Systems [32.86421107987556]
We build upon recent developments in analyzing deep neural networks as numerical solvers of ordinary differential equations.
We formulate a temporal evolution scheme, TransEvolve, to bypass costly dot-product attention over multiple stacked layers.
We perform exhaustive experiments with TransEvolve on well-known encoder-decoder as well as encoder-only tasks.
arXiv Detail & Related papers (2021-09-30T14:01:06Z) - Dynamic Multi-Branch Layers for On-Device Neural Machine Translation [53.637479651600586]
We propose to improve the performance of on-device neural machine translation (NMT) systems with dynamic multi-branch layers.
Specifically, we design a layer-wise dynamic multi-branch network with only one branch activated during training and inference.
At almost the same computational cost, our method achieves improvements of up to 1.7 BLEU points on the WMT14 English-German translation task and 1.8 BLEU points on the WMT20 Chinese-English translation task.
arXiv Detail & Related papers (2021-05-14T07:32:53Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.