Rethinking embedding coupling in pre-trained language models
- URL: http://arxiv.org/abs/2010.12821v1
- Date: Sat, 24 Oct 2020 07:43:00 GMT
- Title: Rethinking embedding coupling in pre-trained language models
- Authors: Hyung Won Chung, Thibault F\'evry, Henry Tsai, Melvin Johnson,
Sebastian Ruder
- Abstract summary: We re-evaluate the standard practice of sharing weights between input and output embeddings in pre-trained language models.
We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation.
We are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.
- Score: 46.11201932668366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We re-evaluate the standard practice of sharing weights between input and
output embeddings in state-of-the-art pre-trained language models. We show that
decoupled embeddings provide increased modeling flexibility, allowing us to
significantly improve the efficiency of parameter allocation in the input
embedding of multilingual models. By reallocating the input embedding
parameters in the Transformer layers, we achieve dramatically better
performance on standard natural language understanding tasks with the same
number of parameters during fine-tuning. We also show that allocating
additional capacity to the output embedding provides benefits to the model that
persist through the fine-tuning stage even though the output embedding is
discarded after pre-training. Our analysis shows that larger output embeddings
prevent the model's last layers from overspecializing to the pre-training task
and encourage Transformer representations to be more general and more
transferable to other tasks and languages. Harnessing these findings, we are
able to train models that achieve strong performance on the XTREME benchmark
without increasing the number of parameters at the fine-tuning stage.
Related papers
- GASE: Generatively Augmented Sentence Encoding [0.0]
We propose an approach to enhance sentence embeddings by applying generative text models for data augmentation at inference time.
Generatively Augmented Sentence uses diverse synthetic variants of input texts generated by paraphrasing, summarising or extracting keywords.
We find that generative augmentation leads to larger performance improvements for embedding models with lower baseline performance.
arXiv Detail & Related papers (2024-11-07T17:53:47Z) - ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model [9.1108256816605]
We propose a method to improve model representation and processing efficiency by replacing the tokenizers of large language models (LLMs)
Our method can maintain the performance of the model after replacing the tokenizer, while significantly improving the decoding speed for long texts.
arXiv Detail & Related papers (2024-10-06T03:01:07Z) - On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based
Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models.
We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z) - Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large
Language Models for Dynamic Inference [32.62084449979531]
We extend SortedNet to generative NLP tasks by replacing Standard Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT)
Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference.
Our results show the superior performance of sub-models in comparison to Standard Fine-Tuning and SFT+ICT (Early-Exit)
arXiv Detail & Related papers (2023-09-16T11:58:34Z) - Leveraging Synthetic Targets for Machine Translation [5.302421715411791]
We show that training models on synthetic targets outperforms training on the actual ground-truth data.
We provide preliminary analysis into whether this boost in performance is linked to ease of optimization or more deterministic nature of the predictions.
arXiv Detail & Related papers (2023-05-07T07:42:22Z) - A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained
Models [87.7086269902562]
We show that subword-based models might still be the most practical choice in many settings.
We encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
arXiv Detail & Related papers (2022-10-13T15:47:09Z) - Investigating Efficiently Extending Transformers for Long Input
Summarization [37.622021824791254]
We investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization.
We find that a staggered, block-local Transformer with global tokens strikes a good balance of performance and efficiency.
We introduce PEG-X, an extension of the PEG model with additional long input pretraining to handle inputs up to 16K tokens.
arXiv Detail & Related papers (2022-08-08T18:10:58Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Pretrained Transformers as Universal Computation Engines [105.00539596788127]
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning.
We study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction.
We find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
arXiv Detail & Related papers (2021-03-09T06:39:56Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.