BPE vs. Morphological Segmentation: A Case Study on Machine Translation
of Four Polysynthetic Languages
- URL: http://arxiv.org/abs/2203.08954v1
- Date: Wed, 16 Mar 2022 21:27:20 GMT
- Title: BPE vs. Morphological Segmentation: A Case Study on Machine Translation
of Four Polysynthetic Languages
- Authors: Manuel Mager and Arturo Oncevay and Elisabeth Mager and Katharina Kann
and Ngoc Thang Vu
- Abstract summary: We investigate a variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages.
We compare the morphologically inspired segmentation methods against Byte-Pair s (BPEs) as inputs for machine translation.
We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently.
- Score: 38.5427201289742
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Morphologically-rich polysynthetic languages present a challenge for NLP
systems due to data sparsity, and a common strategy to handle this issue is to
apply subword segmentation. We investigate a wide variety of supervised and
unsupervised morphological segmentation methods for four polysynthetic
languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare
the morphologically inspired segmentation methods against Byte-Pair Encodings
(BPEs) as inputs for machine translation (MT) when translating to and from
Spanish. We show that for all language pairs except for Nahuatl, an
unsupervised morphological segmentation algorithm outperforms BPEs consistently
and that, although supervised methods achieve better segmentation scores, they
under-perform in MT challenges. Finally, we contribute two new morphological
segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus
for Raramuri--Spanish.
Related papers
- A Truly Joint Neural Architecture for Segmentation and Parsing [15.866519123942457]
Performance of Morphologically Rich Languages (MRLs) is lower than other languages.
Due to high morphological complexity and ambiguity of the space-delimited input tokens, the linguistic units that act as nodes in the tree are not known in advance.
We introduce a joint neural architecture where a lattice-based representation preserving all morphological ambiguity of the input is provided to an arc-factored model, which then solves the morphological and syntactic parsing tasks at once.
arXiv Detail & Related papers (2024-02-04T16:56:08Z) - Extending Multilingual Machine Translation through Imitation Learning [60.15671816513614]
Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert.
We show that our approach significantly improves the translation performance between the new and the original languages.
We also demonstrate that our approach is capable of solving copy and off-target problems.
arXiv Detail & Related papers (2023-11-14T21:04:03Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Exploring Segmentation Approaches for Neural Machine Translation of
Code-Switched Egyptian Arabic-English Text [29.95141309131595]
We study the effectiveness of different segmentation approaches on machine translation (MT) performance.
We experiment on MT from code-switched Arabic-English to English.
We find that the choice of the segmentation setup to use for MT is highly dependent on the data size.
arXiv Detail & Related papers (2022-10-11T23:20:12Z) - Quantifying Synthesis and Fusion and their Impact on Machine Translation [79.61874492642691]
In Natural Language Processing (NLP) typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative.
In this work, we propose to reduce the rigidity of such claims, by quantifying morphological typology at the word and segment level.
For computing literature, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study.
Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish,
arXiv Detail & Related papers (2022-05-06T17:04:58Z) - EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation [63.88541605363555]
"Extract and Generate" (EAG) is a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data.
We first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences.
We then generate the final aligned examples from the candidates with a well-trained generation model.
arXiv Detail & Related papers (2022-03-04T08:21:27Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - How Suitable Are Subword Segmentation Strategies for Translating
Non-Concatenative Morphology? [26.71325671956197]
We design a test suite to evaluate segmentation strategies on different types of morphological phenomena.
We find that learning to analyse and generate morphologically complex surface representations is still challenging.
arXiv Detail & Related papers (2021-09-02T17:23:21Z) - Canonical and Surface Morphological Segmentation for Nguni Languages [6.805575417034369]
This paper investigates supervised and unsupervised models for morphological segmentation.
We train sequence-to-sequence models for canonical segmentation and Conditional Random Fields (CRF) for surface segmentation.
Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages.
We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.
arXiv Detail & Related papers (2021-04-01T21:06:51Z) - The Effectiveness of Morphology-aware Segmentation in Low-Resource
Neural Machine Translation [0.6091702876917281]
This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting.
We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL.
arXiv Detail & Related papers (2021-03-20T14:39:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.