Layer-Wise Multi-View Learning for Neural Machine Translation
- URL: http://arxiv.org/abs/2011.01482v1
- Date: Tue, 3 Nov 2020 05:06:37 GMT
- Title: Layer-Wise Multi-View Learning for Neural Machine Translation
- Authors: Qiang Wang, Changliang Li, Yue Zhang, Tong Xiao, Jingbo Zhu
- Abstract summary: Traditional neural machine translation is limited to the topmost encoder layer's context representation.
We propose layer-wise multi-view learning to solve this problem.
Our approach yields stable improvements over multiple strong baselines.
- Score: 45.679212203943194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional neural machine translation is limited to the topmost encoder
layer's context representation and cannot directly perceive the lower encoder
layers. Existing solutions usually rely on the adjustment of network
architecture, making the calculation more complicated or introducing additional
structural restrictions. In this work, we propose layer-wise multi-view
learning to solve this problem, circumventing the necessity to change the model
structure. We regard each encoder layer's off-the-shelf output, a by-product in
layer-by-layer encoding, as the redundant view for the input sentence. In this
way, in addition to the topmost encoder layer (referred to as the primary
view), we also incorporate an intermediate encoder layer as the auxiliary view.
We feed the two views to a partially shared decoder to maintain independent
prediction. Consistency regularization based on KL divergence is used to
encourage the two views to learn from each other. Extensive experimental
results on five translation tasks show that our approach yields stable
improvements over multiple strong baselines. As another bonus, our method is
agnostic to network architectures and can maintain the same inference speed as
the original model.
Related papers
- Rethinking the adaptive relationship between Encoder Layers and Decoder Layers [2.460250239278795]
This article explores the adaptive relationship between Layers and Decoder Layers using the SOTA model Helsinki-NLP/opusmt-de-en.
The results suggest that directly modifying the pre-trained model structure for fine-tuning yields suboptimal performance.
arXiv Detail & Related papers (2024-05-14T13:05:16Z) - Layer-wise Representation Fusion for Compositional Generalization [26.771056871444692]
A key reason for failure on compositional generalization is that the syntactic and semantic representations of sequences in both the uppermost layer of the encoder and decoder are entangled.
We explain why it exists by analyzing the representation evolving mechanism from the bottom to the top of the Transformer layers.
Inspired by this, we propose LRF, a novel textbfLayer-wise textbfRepresentation textbfFusion framework for CG, which learns to fuse previous layers' information back into the encoding and decoding process.
arXiv Detail & Related papers (2023-07-20T12:01:40Z) - Learning to Compose Representations of Different Encoder Layers towards
Improving Compositional Generalization [29.32436551704417]
We propose textscCompoSition (textbfCompose textbfSyntactic and Semanttextbfic Representatextbftions)
textscCompoSition achieves competitive results on two comprehensive and realistic benchmarks.
arXiv Detail & Related papers (2023-05-20T11:16:59Z) - Exploring and Exploiting Multi-Granularity Representations for Machine
Reading Comprehension [13.191437539419681]
We propose a novel approach called Adaptive Bidirectional Attention-Capsule Network (ABA-Net)
ABA-Net adaptively exploits the source representations of different levels to the predictor.
We set the new state-of-the-art performance on the SQuAD 1.0 dataset.
arXiv Detail & Related papers (2022-08-18T10:14:32Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z) - Dual-constrained Deep Semi-Supervised Coupled Factorization Network with
Enriched Prior [80.5637175255349]
We propose a new enriched prior based Dual-constrained Deep Semi-Supervised Coupled Factorization Network, called DS2CF-Net.
To ex-tract hidden deep features, DS2CF-Net is modeled as a deep-structure and geometrical structure-constrained neural network.
Our network can obtain state-of-the-art performance for representation learning and clustering.
arXiv Detail & Related papers (2020-09-08T13:10:21Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z) - Suppress and Balance: A Simple Gated Network for Salient Object
Detection [89.88222217065858]
We propose a simple gated network (GateNet) to solve both issues at once.
With the help of multilevel gate units, the valuable context information from the encoder can be optimally transmitted to the decoder.
In addition, we adopt the atrous spatial pyramid pooling based on the proposed "Fold" operation (Fold-ASPP) to accurately localize salient objects of various scales.
arXiv Detail & Related papers (2020-07-16T02:00:53Z) - Rethinking and Improving Natural Language Generation with Layer-Wise
Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder.
Recent work has proposed to use representations from different encoder layers for diversified levels of information.
We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.