Related papers: Multi-branch Attentive Transformer

Multi-branch Attentive Transformer

URL: http://arxiv.org/abs/2006.10270v2
Date: Sun, 26 Jul 2020 13:04:49 GMT
Title: Multi-branch Attentive Transformer
Authors: Yang Fan, Shufang Xie, Yingce Xia, Lijun Wu, Tao Qin, Xiang-Yang Li, Tie-Yan Liu
Abstract summary: We propose a simple yet effective variant of Transformer called multi-branch attentive Transformer. The attention layer is the average of multiple branches and each branch is an independent multi-head attention layer. Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements.
Score: 152.07840447196384
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While the multi-branch architecture is one of the key ingredients to the success of computer vision tasks, it has not been well investigated in natural language processing, especially sequence learning tasks. In this work, we propose a simple yet effective variant of Transformer called multi-branch attentive Transformer (briefly, MAT), where the attention layer is the average of multiple branches and each branch is an independent multi-head attention layer. We leverage two training techniques to regularize the training: drop-branch, which randomly drops individual branches during training, and proximal initialization, which uses a pre-trained Transformer model to initialize multiple branches. Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements. Our code is available at \url{https://github.com/HA-Transformer}.

Related papers

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions. The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z)
TranSFormer: Slow-Fast Transformer for Machine Translation [52.12212173775029]
We present a textbfSlow-textbfFast two-stream learning model, referred to as TrantextbfSFormer. Our TranSFormer shows consistent BLEU improvements (larger than 1 BLEU point) on several machine translation benchmarks.
arXiv Detail & Related papers (2023-05-26T14:37:38Z)
Semantic-visual Guided Transformer for Few-shot Class-incremental Learning [6.300141694311465]
We develop a semantic-visual guided Transformer (SV-T) to enhance the feature extracting capacity of the pre-trained feature backbone on incremental classes. Our SV-T can take full advantage of more supervision information from base classes and further enhance the training robustness of the feature backbone.
arXiv Detail & Related papers (2023-03-27T15:06:49Z)
Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our method consists of a multimodal transformer that encodes visual observations and language instructions. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z)
Transformer over Pre-trained Transformer for Neural Text Segmentation with Enhanced Topic Coherence [6.73258176462356]
It consists of two components: bottom-level sentence encoders using pre-trained transformers, and an upper-level transformer-based segmentation model based on the sentence embeddings. Our experiments show that Transformer$2$ manages to surpass state-of-the-art text segmentation models in terms of a commonly-used semantic coherence measure.
arXiv Detail & Related papers (2021-10-14T05:26:39Z)
Redesigning the Transformer Architecture with Insights from Multi-particle Dynamical Systems [32.86421107987556]
We build upon recent developments in analyzing deep neural networks as numerical solvers of ordinary differential equations. We formulate a temporal evolution scheme, TransEvolve, to bypass costly dot-product attention over multiple stacked layers. We perform exhaustive experiments with TransEvolve on well-known encoder-decoder as well as encoder-only tasks.
arXiv Detail & Related papers (2021-09-30T14:01:06Z)
Dynamic Multi-Branch Layers for On-Device Neural Machine Translation [53.637479651600586]
We propose to improve the performance of on-device neural machine translation (NMT) systems with dynamic multi-branch layers. Specifically, we design a layer-wise dynamic multi-branch network with only one branch activated during training and inference. At almost the same computational cost, our method achieves improvements of up to 1.7 BLEU points on the WMT14 English-German translation task and 1.8 BLEU points on the WMT20 Chinese-English translation task.
arXiv Detail & Related papers (2021-05-14T07:32:53Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.