Rethinking the adaptive relationship between Encoder Layers and Decoder Layers
- URL: http://arxiv.org/abs/2405.08570v1
- Date: Tue, 14 May 2024 13:05:16 GMT
- Title: Rethinking the adaptive relationship between Encoder Layers and Decoder Layers
- Authors: Yubo Song,
- Abstract summary: This article explores the adaptive relationship between Layers and Decoder Layers using the SOTA model Helsinki-NLP/opusmt-de-en.
The results suggest that directly modifying the pre-trained model structure for fine-tuning yields suboptimal performance.
- Score: 2.460250239278795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This article explores the adaptive relationship between Encoder Layers and Decoder Layers using the SOTA model Helsinki-NLP/opus-mt-de-en, which translates German to English. The specific method involves introducing a bias-free fully connected layer between the Encoder and Decoder, with different initializations of the layer's weights, and observing the outcomes of fine-tuning versus retraining. Four experiments were conducted in total. The results suggest that directly modifying the pre-trained model structure for fine-tuning yields suboptimal performance. However, upon observing the outcomes of the experiments with retraining, this structural adjustment shows significant potential.
Related papers
- Layer-wise Representation Fusion for Compositional Generalization [26.771056871444692]
A key reason for failure on compositional generalization is that the syntactic and semantic representations of sequences in both the uppermost layer of the encoder and decoder are entangled.
We explain why it exists by analyzing the representation evolving mechanism from the bottom to the top of the Transformer layers.
Inspired by this, we propose LRF, a novel textbfLayer-wise textbfRepresentation textbfFusion framework for CG, which learns to fuse previous layers' information back into the encoding and decoding process.
arXiv Detail & Related papers (2023-07-20T12:01:40Z) - Towards Efficient Fine-tuning of Pre-trained Code Models: An
Experimental Study and Beyond [52.656743602538825]
Fine-tuning pre-trained code models incurs a large computational cost.
We conduct an experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning.
We propose Telly to efficiently fine-tune pre-trained code models via layer freezing.
arXiv Detail & Related papers (2023-04-11T13:34:13Z) - Quick Dense Retrievers Consume KALE: Post Training Kullback Leibler
Alignment of Embeddings for Asymmetrical dual encoders [89.29256833403169]
We introduce Kullback Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods.
KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation.
Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference.
arXiv Detail & Related papers (2023-03-31T15:44:13Z) - WLD-Reg: A Data-dependent Within-layer Diversity Regularizer [98.78384185493624]
Neural networks are composed of multiple layers arranged in a hierarchical structure jointly trained with a gradient-based optimization.
We propose to complement this traditional 'between-layer' feedback with additional 'within-layer' feedback to encourage the diversity of the activations within the same layer.
We present an extensive empirical study confirming that the proposed approach enhances the performance of several state-of-the-art neural network models in multiple tasks.
arXiv Detail & Related papers (2023-01-03T20:57:22Z) - ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking
Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking.
We finetune a pretrained encoder-decoder model using in the form of document to query generation.
We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z) - Latency Adjustable Transformer Encoder for Language Understanding [0.8287206589886879]
This paper proposes an efficient Transformer architecture that adjusts the inference computational cost adaptively with a desired inference latency speedup.
The proposed method detects less important hidden sequence elements (word-vectors) and eliminates them in each encoder layer using a proposed Attention Context Contribution (ACC) metric.
The proposed method mathematically and experimentally improves the inference latency of BERT_base and GPT-2 by up to 4.8 and 3.72 times with less than 0.75% accuracy drop and passable perplexity on average.
arXiv Detail & Related papers (2022-01-10T13:04:39Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - Layer-Wise Multi-View Learning for Neural Machine Translation [45.679212203943194]
Traditional neural machine translation is limited to the topmost encoder layer's context representation.
We propose layer-wise multi-view learning to solve this problem.
Our approach yields stable improvements over multiple strong baselines.
arXiv Detail & Related papers (2020-11-03T05:06:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.