Cost-Effective Training in Low-Resource Neural Machine Translation
- URL: http://arxiv.org/abs/2201.05700v1
- Date: Fri, 14 Jan 2022 22:57:14 GMT
- Title: Cost-Effective Training in Low-Resource Neural Machine Translation
- Authors: Sai Koneru, Danni Liu, Jan Niehues
- Abstract summary: We propose a cost-effective training procedure to increase the performance of NMT models utilizing a small number of annotated sentences and dictionary entries.
We show that improving the model using a combination of these knowledge sources is essential to exploit AL strategies and increase gains in low-resource conditions.
- Score: 12.968557512440759
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Active Learning (AL) techniques are explored in Neural Machine
Translation (NMT), only a few works focus on tackling low annotation budgets
where a limited number of sentences can get translated. Such situations are
especially challenging and can occur for endangered languages with few human
annotators or having cost constraints to label large amounts of data. Although
AL is shown to be helpful with large budgets, it is not enough to build
high-quality translation systems in these low-resource conditions. In this
work, we propose a cost-effective training procedure to increase the
performance of NMT models utilizing a small number of annotated sentences and
dictionary entries. Our method leverages monolingual data with self-supervised
objectives and a small-scale, inexpensive dictionary for additional supervision
to initialize the NMT model before applying AL. We show that improving the
model using a combination of these knowledge sources is essential to exploit AL
strategies and increase gains in low-resource conditions. We also present a
novel AL strategy inspired by domain adaptation for NMT and show that it is
effective for low budgets. We propose a new hybrid data-driven approach, which
samples sentences that are diverse from the labelled data and also most similar
to unlabelled data. Finally, we show that initializing the NMT model and
further using our AL strategy can achieve gains of up to $13$ BLEU compared to
conventional AL methods.
Related papers
- Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages [1.149936119867417]
Low-resource languages face significant barriers in AI development due to limited linguistic resources and expertise for data labeling.
We propose leveraging the potential of LLMs in the active learning loop for data annotation.
Empirical evaluations, notably employing GPT-4-Turbo, demonstrate near-state-of-the-art performance with significantly reduced data requirements.
arXiv Detail & Related papers (2024-04-02T19:34:22Z) - Augmenting NER Datasets with LLMs: Towards Automated and Refined Annotation [1.6893691730575022]
This research introduces a novel hybrid annotation approach that synergizes human effort with the capabilities of Large Language Models (LLMs)
By employing a label mixing strategy, it addresses the issue of class imbalance encountered in LLM-based annotations.
This study illuminates the potential of leveraging LLMs to improve dataset quality, introduces a novel technique to mitigate class imbalances, and demonstrates the feasibility of achieving high-performance NER in a cost-effective way.
arXiv Detail & Related papers (2024-03-30T12:13:57Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - Semi-supervised Neural Machine Translation with Consistency
Regularization for Low-Resource Languages [3.475371300689165]
This paper presents a simple yet effective method to tackle the problem for low-resource languages by augmenting high-quality sentence pairs and training NMT models in a semi-supervised manner.
Specifically, our approach combines the cross-entropy loss for supervised learning with KL Divergence for unsupervised fashion given pseudo and augmented target sentences.
Experimental results show that our approach significantly improves NMT baselines, especially on low-resource datasets with 0.46--2.03 BLEU scores.
arXiv Detail & Related papers (2023-04-02T15:24:08Z) - An Efficient Active Learning Pipeline for Legal Text Classification [2.462514989381979]
We propose a pipeline for effectively using active learning with pre-trained language models in the legal domain.
We use knowledge distillation to guide the model's embeddings to a semantically meaningful space.
Our experiments on Contract-NLI, adapted to the classification task, and LEDGAR benchmarks show that our approach outperforms standard AL strategies.
arXiv Detail & Related papers (2022-11-15T13:07:02Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Low-Resource Machine Translation for Low-Resource Languages: Leveraging
Comparable Data, Code-Switching and Compute Resources [4.119597443825115]
We conduct an empirical study of unsupervised neural machine translation (NMT) for truly low resource languages.
We show how adding comparable data mined using a bilingual dictionary along with modest additional compute resource to train the model can significantly improve its performance.
Our work is the first to quantitatively showcase the impact of different modest compute resource in low resource NMT.
arXiv Detail & Related papers (2021-03-24T15:40:28Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z) - Language Model Prior for Low-Resource Neural Machine Translation [85.55729693003829]
We propose a novel approach to incorporate a LM as prior in a neural translation model (TM)
We add a regularization term, which pushes the output distributions of the TM to be probable under the LM prior.
Results on two low-resource machine translation datasets show clear improvements even with limited monolingual data.
arXiv Detail & Related papers (2020-04-30T16:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.