The Effectiveness of Morphology-aware Segmentation in Low-Resource
Neural Machine Translation
- URL: http://arxiv.org/abs/2103.11189v1
- Date: Sat, 20 Mar 2021 14:39:25 GMT
- Title: The Effectiveness of Morphology-aware Segmentation in Low-Resource
Neural Machine Translation
- Authors: Jonne S\"alev\"a and Constantine Lignos
- Abstract summary: This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting.
We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL.
- Score: 0.6091702876917281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper evaluates the performance of several modern subword segmentation
methods in a low-resource neural machine translation setting. We compare
segmentations produced by applying BPE at the token or sentence level with
morphologically-based segmentations from LMVR and MORSEL. We evaluate
translation tasks between English and each of Nepali, Sinhala, and Kazakh, and
predict that using morphologically-based segmentation methods would lead to
better performance in this setting. However, comparing to BPE, we find that no
consistent and reliable differences emerge between the segmentation methods.
While morphologically-based methods outperform BPE in a few cases, what
performs best tends to vary across tasks, and the performance of segmentation
methods is often statistically indistinguishable.
Related papers
- MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [75.2540291039202]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost.
We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages.
Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points.
By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z) - Effects of sub-word segmentation on performance of transformer language
models [0.628122931748758]
We compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation.
We show that training with morphological segmentation allows the LMs to: 1. achieve lower perplexity, 2. converge more efficiently in terms of training time, and 3. achieve equivalent or better evaluation scores on downstream tasks.
arXiv Detail & Related papers (2023-05-09T14:30:29Z) - Learning Context-aware Classifier for Semantic Segmentation [88.88198210948426]
In this paper, contextual hints are exploited via learning a context-aware classifier.
Our method is model-agnostic and can be easily applied to generic segmentation models.
With only negligible additional parameters and +2% inference time, decent performance gain has been achieved on both small and large models.
arXiv Detail & Related papers (2023-03-21T07:00:35Z) - Exploring Segmentation Approaches for Neural Machine Translation of
Code-Switched Egyptian Arabic-English Text [29.95141309131595]
We study the effectiveness of different segmentation approaches on machine translation (MT) performance.
We experiment on MT from code-switched Arabic-English to English.
We find that the choice of the segmentation setup to use for MT is highly dependent on the data size.
arXiv Detail & Related papers (2022-10-11T23:20:12Z) - Single Model Ensemble for Subword Regularized Models in Low-Resource
Machine Translation [25.04086897886412]
Subword regularizations use multiple subword segmentations during training to improve the robustness of neural machine translation models.
We propose an inference strategy to address this discrepancy.
Experimental results show that the proposed strategy improves the performance of models trained with subword regularization.
arXiv Detail & Related papers (2022-03-25T09:25:47Z) - BPE vs. Morphological Segmentation: A Case Study on Machine Translation
of Four Polysynthetic Languages [38.5427201289742]
We investigate a variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages.
We compare the morphologically inspired segmentation methods against Byte-Pair s (BPEs) as inputs for machine translation.
We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently.
arXiv Detail & Related papers (2022-03-16T21:27:20Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Dynamic Programming Encoding for Subword Segmentation in Neural Machine
Translation [80.38621085548013]
This paper introduces Dynamic Programming (DPE) a new segmentation algorithm for tokenizing sentences into subword units.
A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations.
arXiv Detail & Related papers (2020-05-03T05:00:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.