Local Translation Services for Neglected Languages
- URL: http://arxiv.org/abs/2101.01628v2
- Date: Wed, 13 Jan 2021 20:09:07 GMT
- Title: Local Translation Services for Neglected Languages
- Authors: David Noever, Josh Kalin, Matt Ciolino, Dom Hambrick, and Gerry Dozier
- Abstract summary: This research illustrates translating two historically interesting, but obfuscated languages: 1) hacker-speak ("l33t") and 2) reverse (or "mirror") writing as practiced by Leonardo da Vinci.
The original contribution highlights a fluent translator of hacker-speak in under 50 megabytes.
The long short-term memory, recurrent neural network (LSTM-RNN) extends previous work demonstrating an English-to-foreign translation service built from as little as 10,000 bilingual sentence pairs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Taking advantage of computationally lightweight, but high-quality translators
prompt consideration of new applications that address neglected languages.
Locally run translators for less popular languages may assist data projects
with protected or personal data that may require specific compliance checks
before posting to a public translation API, but which could render reasonable,
cost-effective solutions if done with an army of local, small-scale pair
translators. Like handling a specialist's dialect, this research illustrates
translating two historically interesting, but obfuscated languages: 1)
hacker-speak ("l33t") and 2) reverse (or "mirror") writing as practiced by
Leonardo da Vinci. The work generalizes a deep learning architecture to
translatable variants of hacker-speak with lite, medium, and hard vocabularies.
The original contribution highlights a fluent translator of hacker-speak in
under 50 megabytes and demonstrates a generator for augmenting future datasets
with greater than a million bilingual sentence pairs. The long short-term
memory, recurrent neural network (LSTM-RNN) extends previous work demonstrating
an English-to-foreign translation service built from as little as 10,000
bilingual sentence pairs. This work further solves the equivalent translation
problem in twenty-six additional (non-obfuscated) languages and rank orders
those models and their proficiency quantitatively with Italian as the most
successful and Mandarin Chinese as the most challenging. For neglected
languages, the method prototypes novel services for smaller niche translations
such as Kabyle (Algerian dialect) which covers between 5-7 million speakers but
one which for most enterprise translators, has not yet reached development. One
anticipates the extension of this approach to other important dialects, such as
translating technical (medical or legal) jargon and processing health records.
Related papers
- Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - Do Multilingual Language Models Think Better in English? [24.713751471567395]
Translate-test is a popular technique to improve the performance of multilingual language models.
In this work, we introduce a new approach called self-translate, which overcomes the need of an external translation system.
arXiv Detail & Related papers (2023-08-02T15:29:22Z) - On the Copying Problem of Unsupervised NMT: A Training Schedule with a
Language Discriminator Loss [120.19360680963152]
unsupervised neural machine translation (UNMT) has achieved success in many language pairs.
The copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs.
We propose a simple but effective training schedule that incorporates a language discriminator loss.
arXiv Detail & Related papers (2023-05-26T18:14:23Z) - Chain-of-Dictionary Prompting Elicits Translation in Large Language Models [100.47154959254937]
Large language models (LLMs) have shown surprisingly good performance in multilingual neural machine translation (MNMT)
We present a novel method, CoD, which augments LLMs with prior knowledge with the chains of multilingual dictionaries for a subset of input words to elicit translation abilities.
arXiv Detail & Related papers (2023-05-11T05:19:47Z) - Train Global, Tailor Local: Minimalist Multilingual Translation into
Endangered Languages [26.159803412486955]
In humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine.
We attempt to leverage translation resources from many rich resource languages to efficiently produce best possible translation quality.
We find that adapting large pretrained multilingual models to the domain/text first and then to the severely low resource language works best.
arXiv Detail & Related papers (2023-05-05T23:22:16Z) - Bootstrapping Multilingual Semantic Parsers using Large Language Models [28.257114724384806]
translate-train paradigm of transferring English datasets across multiple languages remains to be the key ingredient for training task-specific multilingual models.
We consider the task of multilingual semantic parsing and demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting.
arXiv Detail & Related papers (2022-10-13T19:34:14Z) - Unsupervised Transfer Learning in Multilingual Neural Machine
Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language.
Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English.
We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.