Morphological Word Segmentation on Agglutinative Languages for Neural
Machine Translation
- URL: http://arxiv.org/abs/2001.01589v1
- Date: Thu, 2 Jan 2020 10:05:02 GMT
- Title: Morphological Word Segmentation on Agglutinative Languages for Neural
Machine Translation
- Authors: Yirong Pan, Xiao Li, Yating Yang and Rui Dong
- Abstract summary: We propose a morphological word segmentation method on the source-side for Neural machine translation (NMT)
It incorporates morphology knowledge to preserve the linguistic and semantic information in the word structure while reducing the vocabulary size at training time.
It can be utilized as a preprocessing tool to segment the words in agglutinative languages for other natural language processing (NLP) tasks.
- Score: 8.87546236839959
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural machine translation (NMT) has achieved impressive performance on
machine translation task in recent years. However, in consideration of
efficiency, a limited-size vocabulary that only contains the top-N highest
frequency words are employed for model training, which leads to many rare and
unknown words. It is rather difficult when translating from the low-resource
and morphologically-rich agglutinative languages, which have complex morphology
and large vocabulary. In this paper, we propose a morphological word
segmentation method on the source-side for NMT that incorporates morphology
knowledge to preserve the linguistic and semantic information in the word
structure while reducing the vocabulary size at training time. It can be
utilized as a preprocessing tool to segment the words in agglutinative
languages for other natural language processing (NLP) tasks. Experimental
results show that our morphologically motivated word segmentation method is
better suitable for the NMT model, which achieves significant improvements on
Turkish-English and Uyghur-Chinese machine translation tasks on account of
reducing data sparseness and language complexity.
Related papers
- An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Code-Switching with Word Senses for Pretraining in Neural Machine
Translation [107.23743153715799]
We introduce Word Sense Pretraining for Neural Machine Translation (WSP-NMT)
WSP-NMT is an end-to-end approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases.
Our experiments show significant improvements in overall translation quality.
arXiv Detail & Related papers (2023-10-21T16:13:01Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - Morphology Without Borders: Clause-Level Morphological Annotation [8.559428282730021]
We propose to view morphology as a clause-level phenomenon, rather than word-level.
We deliver a novel dataset for clause-level morphology covering 4 typologically-different languages: English, German, Turkish and Hebrew.
Our experiments show that the clause-level tasks are substantially harder than the respective word-level tasks, while having comparable complexity across languages.
arXiv Detail & Related papers (2022-02-25T17:20:28Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - Evaluation of Morphological Embeddings for the Russian Language [0.0]
morphology-based embeddings trained with Skipgram objective do not outperform existing embedding model -- FastText.
A more complex, but morphology unaware model, BERT, allows to achieve significantly greater performance on the tasks that presumably require understanding of a word's morphology.
arXiv Detail & Related papers (2021-03-11T11:59:11Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Finding the Optimal Vocabulary Size for Neural Machine Translation [25.38870582223696]
We cast neural machine translation (NMT) as a classification task in an autoregressive setting.
We analyze the limitations of both classification and autoregression components.
We reveal an explanation for why certain vocabulary sizes are better than others.
arXiv Detail & Related papers (2020-04-05T22:17:34Z) - Urdu-English Machine Transliteration using Neural Networks [0.0]
We present transliteration technique based on Expectation Maximization (EM) which is un-supervised and language independent.
System learns the pattern and out-of-vocabulary words from parallel corpus and there is no need to train it on transliteration corpus explicitly.
arXiv Detail & Related papers (2020-01-12T17:30:42Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.