Inference-only sub-character decomposition improves translation of
unseen logographic characters
- URL: http://arxiv.org/abs/2011.06523v1
- Date: Thu, 12 Nov 2020 17:36:22 GMT
- Title: Inference-only sub-character decomposition improves translation of
unseen logographic characters
- Authors: Danielle Saunders, Weston Feely, Bill Byrne
- Abstract summary: Neural Machine Translation (NMT) on logographic source languages struggles when translating unseen' characters.
We investigate existing ideograph-based sub-character decomposition approaches for Chinese-to-English and Japanese-to-English NMT.
We find that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally.
- Score: 18.148675498274866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural Machine Translation (NMT) on logographic source languages struggles
when translating `unseen' characters, which never appear in the training data.
One possible approach to this problem uses sub-character decomposition for
training and test sentences. However, this approach involves complete
retraining, and its effectiveness for unseen character translation to
non-logographic languages has not been fully explored.
We investigate existing ideograph-based sub-character decomposition
approaches for Chinese-to-English and Japanese-to-English NMT, for both
high-resource and low-resource domains. For each language pair and domain we
construct a test set where all source sentences contain at least one unseen
logographic character. We find that complete sub-character decomposition often
harms unseen character translation, and gives inconsistent results generally.
We offer a simple alternative based on decomposition before inference for
unseen characters only. Our approach allows flexible application, achieving
translation adequacy improvements and requiring no additional models or
training.
Related papers
- A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations [0.4499833362998489]
This study focuses on the case of English-Marathi language pairs, where existing datasets are notably noisy.
To mitigate the impact of data quality issues, we propose a data filtering approach based on cross-lingual sentence representations.
Results demonstrate a significant improvement in translation quality over the baseline post-filtering with IndicSBERT.
arXiv Detail & Related papers (2024-09-04T13:49:45Z) - On the Copying Problem of Unsupervised NMT: A Training Schedule with a
Language Discriminator Loss [120.19360680963152]
unsupervised neural machine translation (UNMT) has achieved success in many language pairs.
The copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs.
We propose a simple but effective training schedule that incorporates a language discriminator loss.
arXiv Detail & Related papers (2023-05-26T18:14:23Z) - The Best of Both Worlds: Combining Human and Machine Translations for
Multilingual Semantic Parsing with Active Learning [50.320178219081484]
We propose an active learning approach that exploits the strengths of both human and machine translations.
An ideal utterance selection can significantly reduce the error and bias in the translated data.
arXiv Detail & Related papers (2023-05-22T05:57:47Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - Analyzing the Use of Character-Level Translation with Sparse and Noisy
Datasets [20.50917929755389]
We find that character-level models cut the number of untranslated words by over 40% when applied to sparse and noisy datasets.
We explore the impact of character alignment, phrase table filtering, bitext size and the choice of pivot language on translation quality.
Neither word-nor character-BLEU correlate perfectly with human judgments, due to BLEU's sensitivity to length.
arXiv Detail & Related papers (2021-09-27T07:35:47Z) - Modelling Latent Translations for Cross-Lingual Transfer [47.61502999819699]
We propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model.
We evaluate our novel latent translation-based model on a series of multilingual NLU tasks.
We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average.
arXiv Detail & Related papers (2021-07-23T17:11:27Z) - Phrase-level Active Learning for Neural Machine Translation [107.28450614074002]
We propose an active learning setting where we can spend a given budget on translating in-domain data.
We select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators.
In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods.
arXiv Detail & Related papers (2021-06-21T19:20:42Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.