Token-wise Curriculum Learning for Neural Machine Translation
- URL: http://arxiv.org/abs/2103.11088v1
- Date: Sat, 20 Mar 2021 03:57:59 GMT
- Title: Token-wise Curriculum Learning for Neural Machine Translation
- Authors: Chen Liang, Haoming Jiang, Xiaodong Liu, Pengcheng He, Weizhu Chen,
Jianfeng Gao and Tuo Zhao
- Abstract summary: Existing curriculum learning approaches to Neural Machine Translation (NMT) require sufficient sampling amounts of "easy" samples from training data at the early training stage.
We propose a novel token-wise curriculum learning approach that creates sufficient amounts of easy samples.
Our approach can consistently outperform baselines on 5 language pairs, especially for low-resource languages.
- Score: 94.93133801641707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing curriculum learning approaches to Neural Machine Translation (NMT)
require sampling sufficient amounts of "easy" samples from training data at the
early training stage. This is not always achievable for low-resource languages
where the amount of training data is limited. To address such limitation, we
propose a novel token-wise curriculum learning approach that creates sufficient
amounts of easy samples. Specifically, the model learns to predict a short
sub-sequence from the beginning part of each target sentence at the early stage
of training, and then the sub-sequence is gradually expanded as the training
progresses. Such a new curriculum design is inspired by the cumulative effect
of translation errors, which makes the latter tokens more difficult to predict
than the beginning ones. Extensive experiments show that our approach can
consistently outperform baselines on 5 language pairs, especially for
low-resource languages. Combining our approach with sentence-level methods
further improves the performance on high-resource languages.
Related papers
- Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs)
During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.
Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models [40.992566245706996]
We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens.
We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters.
Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
arXiv Detail & Related papers (2023-10-30T13:33:21Z) - Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability [25.52470575274251]
We observe that language models generate short repetitive phrases before learning to generate longer and more coherent text.
Individual tokens often exhibit sudden increases or decreases in loss that are surprisingly consistent across pre-training runs.
More frequent tokens reach lower final surprisals, exhibit less variability within and across pre-training runs, are learned earlier, and are less likely to be "forgotten" during pre-training.
arXiv Detail & Related papers (2023-08-29T16:24:09Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better
Translators [10.557167523009392]
We present Multi-Stage Prompting, a simple and lightweight approach for better adapting pre-trained language models to translation tasks.
To make pre-trained language models better translators, we divide the translation process via pre-trained language models into three separate stages.
During each stage, we independently apply different continuous prompts for allowing pre-trained language models better adapting to translation tasks.
arXiv Detail & Related papers (2021-10-13T10:06:21Z) - How Low is Too Low? A Computational Perspective on Extremely
Low-Resource Languages [1.7625363344837164]
We introduce the first cross-lingual information extraction pipeline for Sumerian.
We also curate InterpretLR, an interpretability toolkit for low-resource NLP.
Most components of our pipeline can be generalised to any other language to obtain an interpretable execution.
arXiv Detail & Related papers (2021-05-30T12:09:59Z) - Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models
via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results.
We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks.
Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.