Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine
Translation: The Case of Fon Language
- URL: http://arxiv.org/abs/2103.08052v2
- Date: Wed, 17 Mar 2021 13:00:28 GMT
- Title: Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine
Translation: The Case of Fon Language
- Authors: Bonaventure F. P. Dossou and Chris C. Emezue
- Abstract summary: We introduce Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training.
We compare our tokenization strategy to others on the Fon-French and French-Fon translation tasks.
- Score: 0.015863809575305417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building effective neural machine translation (NMT) models for very
low-resourced and morphologically rich African indigenous languages is an open
challenge. Besides the issue of finding available resources for them, a lot of
work is put into preprocessing and tokenization. Recent studies have shown that
standard tokenization methods do not always adequately deal with the
grammatical, diacritical, and tonal properties of some African languages. That,
coupled with the extremely low availability of training samples, hinders the
production of reliable NMT models. In this paper, using Fon language as a case
study, we revisit standard tokenization methods and introduce
Word-Expressions-Based (WEB) tokenization, a human-involved super-words
tokenization strategy to create a better representative vocabulary for
training. Furthermore, we compare our tokenization strategy to others on the
Fon-French and French-Fon translation tasks.
Related papers
- Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP [13.662528492286528]
We present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation.
Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language.
We introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy.
arXiv Detail & Related papers (2024-08-08T08:37:28Z) - Problematic Tokens: Tokenizer Bias in Large Language Models [4.7245503050933335]
This paper traces the roots of disparities to the tokenization process inherent to large language models.
Specifically, it explores how the tokenizers vocabulary, often used to speed up the tokenization process, inadequately represents non-English languages.
We aim to dissect the tokenization mechanics of GPT-4o, illustrating how its simplified token-handling methods amplify associated security and ethical issues.
arXiv Detail & Related papers (2024-06-17T05:13:25Z) - Towards Better Chinese-centric Neural Machine Translation for
Low-resource Languages [12.374365655284342]
Building a neural machine translation (NMT) system has become an urgent trend, especially in the low-resource setting.
Recent work tends to study NMT systems for low-resource languages centered on English, while few works focus on low-resource NMT systems centered on other languages such as Chinese.
We present the winner competition system that leverages monolingual word embeddings data enhancement, bilingual curriculum learning, and contrastive re-ranking.
arXiv Detail & Related papers (2022-04-09T01:05:37Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Token-wise Curriculum Learning for Neural Machine Translation [94.93133801641707]
Existing curriculum learning approaches to Neural Machine Translation (NMT) require sufficient sampling amounts of "easy" samples from training data at the early training stage.
We propose a novel token-wise curriculum learning approach that creates sufficient amounts of easy samples.
Our approach can consistently outperform baselines on 5 language pairs, especially for low-resource languages.
arXiv Detail & Related papers (2021-03-20T03:57:59Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - Transfer learning and subword sampling for asymmetric-resource
one-to-many neural translation [14.116412358534442]
Methods for improving neural machine translation for low-resource languages are reviewed.
Tests are carried out on three artificially restricted translation tasks and one real-world task.
Experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.
arXiv Detail & Related papers (2020-04-08T14:19:05Z) - Combining Pretrained High-Resource Embeddings and Subword
Representations for Low-Resource Languages [24.775371434410328]
We explore techniques exploiting the qualities of morphologically rich languages (MRLs)
We show that a meta-embedding approach combining both pretrained and morphologically-informed word embeddings performs best in the downstream task of Xhosa-English translation.
arXiv Detail & Related papers (2020-03-09T21:30:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.