Exploring Segmentation Approaches for Neural Machine Translation of
Code-Switched Egyptian Arabic-English Text
- URL: http://arxiv.org/abs/2210.06990v3
- Date: Sun, 30 Apr 2023 21:07:46 GMT
- Title: Exploring Segmentation Approaches for Neural Machine Translation of
Code-Switched Egyptian Arabic-English Text
- Authors: Marwa Gaser, Manuel Mager, Injy Hamed, Nizar Habash, Slim Abdennadher
and Ngoc Thang Vu
- Abstract summary: We study the effectiveness of different segmentation approaches on machine translation (MT) performance.
We experiment on MT from code-switched Arabic-English to English.
We find that the choice of the segmentation setup to use for MT is highly dependent on the data size.
- Score: 29.95141309131595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data sparsity is one of the main challenges posed by code-switching (CS),
which is further exacerbated in the case of morphologically rich languages. For
the task of machine translation (MT), morphological segmentation has proven
successful in alleviating data sparsity in monolingual contexts; however, it
has not been investigated for CS settings. In this paper, we study the
effectiveness of different segmentation approaches on MT performance, covering
morphology-based and frequency-based segmentation techniques. We experiment on
MT from code-switched Arabic-English to English. We provide detailed analysis,
examining a variety of conditions, such as data size and sentences with
different degrees of CS. Empirical results show that morphology-aware
segmenters perform the best in segmentation tasks but under-perform in MT.
Nevertheless, we find that the choice of the segmentation setup to use for MT
is highly dependent on the data size. For extreme low-resource scenarios, a
combination of frequency and morphology-based segmentations is shown to perform
the best. For more resourced settings, such a combination does not bring
significant improvements over the use of frequency-based segmentation.
Related papers
- TAMS: Translation-Assisted Morphological Segmentation [3.666125285899499]
We present a sequence-to-sequence model for canonical morpheme segmentation.
Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data.
While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.
arXiv Detail & Related papers (2024-03-21T21:23:35Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - Subword Segmental Machine Translation: Unifying Segmentation and Target
Sentence Generation [7.252933737829635]
Subword segmental machine translation (SSMT) learns to segment target sentence words while jointly learning to generate target sentences.
Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages.
arXiv Detail & Related papers (2023-05-11T17:44:29Z) - Improving Simultaneous Machine Translation with Monolingual Data [94.1085601198393]
Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model.
We propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD.
arXiv Detail & Related papers (2022-12-02T14:13:53Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - BPE vs. Morphological Segmentation: A Case Study on Machine Translation
of Four Polysynthetic Languages [38.5427201289742]
We investigate a variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages.
We compare the morphologically inspired segmentation methods against Byte-Pair s (BPEs) as inputs for machine translation.
We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently.
arXiv Detail & Related papers (2022-03-16T21:27:20Z) - Real-Time Scene Text Detection with Differentiable Binarization and
Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field.
We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network.
An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z) - Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with
Self-Supervised Depth Estimation [94.16816278191477]
We present a framework for semi-adaptive and domain-supervised semantic segmentation.
It is enhanced by self-supervised monocular depth estimation trained only on unlabeled image sequences.
We validate the proposed model on the Cityscapes dataset.
arXiv Detail & Related papers (2021-08-28T01:33:38Z) - Canonical and Surface Morphological Segmentation for Nguni Languages [6.805575417034369]
This paper investigates supervised and unsupervised models for morphological segmentation.
We train sequence-to-sequence models for canonical segmentation and Conditional Random Fields (CRF) for surface segmentation.
Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages.
We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.
arXiv Detail & Related papers (2021-04-01T21:06:51Z) - The Effectiveness of Morphology-aware Segmentation in Low-Resource
Neural Machine Translation [0.6091702876917281]
This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting.
We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL.
arXiv Detail & Related papers (2021-03-20T14:39:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.