How Suitable Are Subword Segmentation Strategies for Translating
Non-Concatenative Morphology?
- URL: http://arxiv.org/abs/2109.01100v1
- Date: Thu, 2 Sep 2021 17:23:21 GMT
- Title: How Suitable Are Subword Segmentation Strategies for Translating
Non-Concatenative Morphology?
- Authors: Chantal Amrhein and Rico Sennrich
- Abstract summary: We design a test suite to evaluate segmentation strategies on different types of morphological phenomena.
We find that learning to analyse and generate morphologically complex surface representations is still challenging.
- Score: 26.71325671956197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-driven subword segmentation has become the default strategy for
open-vocabulary machine translation and other NLP tasks, but may not be
sufficiently generic for optimal learning of non-concatenative morphology. We
design a test suite to evaluate segmentation strategies on different types of
morphological phenomena in a controlled, semi-synthetic setting. In our
experiments, we compare how well machine translation models trained on subword-
and character-level can translate these morphological phenomena. We find that
learning to analyse and generate morphologically complex surface
representations is still challenging, especially for non-concatenative
morphological phenomena like reduplication or vowel harmony and for rare word
stems. Based on our results, we recommend that novel text representation
strategies be tested on a range of typologically diverse languages to minimise
the risk of adopting a strategy that inadvertently disadvantages certain
languages.
Related papers
- Quantifying Synthesis and Fusion and their Impact on Machine Translation [79.61874492642691]
In Natural Language Processing (NLP) typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative.
In this work, we propose to reduce the rigidity of such claims, by quantifying morphological typology at the word and segment level.
For computing literature, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study.
Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish,
arXiv Detail & Related papers (2022-05-06T17:04:58Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - BPE vs. Morphological Segmentation: A Case Study on Machine Translation
of Four Polysynthetic Languages [38.5427201289742]
We investigate a variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages.
We compare the morphologically inspired segmentation methods against Byte-Pair s (BPEs) as inputs for machine translation.
We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently.
arXiv Detail & Related papers (2022-03-16T21:27:20Z) - Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text.
We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z) - Evaluation of Morphological Embeddings for the Russian Language [0.0]
morphology-based embeddings trained with Skipgram objective do not outperform existing embedding model -- FastText.
A more complex, but morphology unaware model, BERT, allows to achieve significantly greater performance on the tasks that presumably require understanding of a word's morphology.
arXiv Detail & Related papers (2021-03-11T11:59:11Z) - Morphology Matters: A Multilingual Language Modeling Analysis [8.791030561752384]
Prior studies disagree on whether inflectional morphology makes languages harder to model.
We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.
Several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data.
arXiv Detail & Related papers (2020-12-11T11:55:55Z) - Neural disambiguation of lemma and part of speech in morphologically
rich languages [0.6346772579930928]
We consider the problem of disambiguating the lemma and part of speech of ambiguous words in morphologically rich languages.
We propose a method for disambiguating ambiguous words in context, using a large un-annotated corpus of text, and a morphological analyser.
arXiv Detail & Related papers (2020-07-12T21:48:52Z) - A Comparative Study of Lexical Substitution Approaches based on Neural
Language Models [117.96628873753123]
We present a large-scale comparative study of popular neural language and masked language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further improved if information about the target word is injected properly.
arXiv Detail & Related papers (2020-05-29T18:43:22Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.