Multi-Pass Transformer for Machine Translation
- URL: http://arxiv.org/abs/2009.11382v1
- Date: Wed, 23 Sep 2020 21:22:15 GMT
- Title: Multi-Pass Transformer for Machine Translation
- Authors: Peng Gao, Chiori Hori, Shijie Geng, Takaaki Hori, Jonathan Le Roux
- Abstract summary: We consider a multi-pass transformer (MPT) architecture in which earlier layers are allowed to process information in light of the output of later layers.
MPT can surpass the performance of Large Transformer on the challenging machine translation En-De and En-Fr datasets.
In the hard connection case, the optimal connection pattern found for En-De also leads to improved performance for En-Fr.
- Score: 51.867982400693194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In contrast with previous approaches where information flows only towards
deeper layers of a stack, we consider a multi-pass transformer (MPT)
architecture in which earlier layers are allowed to process information in
light of the output of later layers. To maintain a directed acyclic graph
structure, the encoder stack of a transformer is repeated along a new
multi-pass dimension, keeping the parameters tied, and information is allowed
to proceed unidirectionally both towards deeper layers within an encoder stack
and towards any layer of subsequent stacks. We consider both soft (i.e.,
continuous) and hard (i.e., discrete) connections between parallel encoder
stacks, relying on a neural architecture search to find the best connection
pattern in the hard case. We perform an extensive ablation study of the
proposed MPT architecture and compare it with other state-of-the-art
transformer architectures. Surprisingly, Base Transformer equipped with MPT can
surpass the performance of Large Transformer on the challenging machine
translation En-De and En-Fr datasets. In the hard connection case, the optimal
connection pattern found for En-De also leads to improved performance for
En-Fr.
Related papers
- Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - Pyramid Hierarchical Transformer for Hyperspectral Image Classification [1.9427851979929982]
We propose a pyramid-based hierarchical transformer (PyFormer)
This innovative approach organizes input data hierarchically into segments, each representing distinct abstraction levels.
Results underscore the superiority of the proposed method over traditional approaches.
arXiv Detail & Related papers (2024-04-23T11:41:19Z) - DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging [34.643717080240584]
We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size.
Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations.
Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models.
arXiv Detail & Related papers (2024-02-04T21:44:09Z) - CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers [71.32827362323205]
We propose a new class of linear Transformers calledLearner-Transformers (Learners)
They incorporate a wide range of relative positional encoding mechanisms (RPEs)
These include regular RPE techniques applied for sequential data, as well as novel RPEs operating on geometric data embedded in higher-dimensional Euclidean spaces.
arXiv Detail & Related papers (2023-02-03T18:57:17Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D
Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation.
Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path.
Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z) - Rewiring the Transformer with Depth-Wise LSTMs [55.50278212605607]
We present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers.
Experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task.
arXiv Detail & Related papers (2020-07-13T09:19:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.