TAMS: Translation-Assisted Morphological Segmentation
- URL: http://arxiv.org/abs/2403.14840v2
- Date: Tue, 15 Oct 2024 16:34:51 GMT
- Title: TAMS: Translation-Assisted Morphological Segmentation
- Authors: Enora Rice, Ali Marashian, Luke Gessler, Alexis Palmer, Katharina von der Wense,
- Abstract summary: We present a sequence-to-sequence model for canonical morpheme segmentation.
Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data.
While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.
- Score: 3.666125285899499
- License:
- Abstract: Canonical morphological segmentation is the process of analyzing words into the standard (aka underlying) forms of their constituent morphemes. This is a core task in language documentation, and NLP systems have the potential to dramatically speed up this process. But in typical language documentation settings, training data for canonical morpheme segmentation is scarce, making it difficult to train high quality models. However, translation data is often much more abundant, and, in this work, we present a method that attempts to leverage this data in the canonical segmentation task. We propose a character-level sequence-to-sequence model that incorporates representations of translations obtained from pretrained high-resource monolingual language models as an additional signal. Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data. While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.
Related papers
- Using Machine Translation to Augment Multilingual Classification [0.0]
We explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages.
We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.
arXiv Detail & Related papers (2024-05-09T00:31:59Z) - Low-resource neural machine translation with morphological modeling [3.3721926640077804]
Morphological modeling in neural machine translation (NMT) is a promising approach to achieving open-vocabulary machine translation.
We propose a framework-solution for modeling complex morphology in low-resource settings.
We evaluate our proposed solution on Kinyarwanda - English translation using public-domain parallel text.
arXiv Detail & Related papers (2024-04-03T01:31:41Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - WangchanBERTa: Pretraining transformer-based Thai Language Models [2.186960190193067]
We pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size)
We apply text processing rules that are specific to Thai most importantly preserving spaces.
We also experiment with word-level, syllable-level and SentencePiece tokenization with a smaller dataset to explore the effects on tokenization on downstream performance.
arXiv Detail & Related papers (2021-01-24T03:06:34Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.