Tackling the Low-resource Challenge for Canonical Segmentation
- URL: http://arxiv.org/abs/2010.02804v1
- Date: Tue, 6 Oct 2020 15:15:05 GMT
- Title: Tackling the Low-resource Challenge for Canonical Segmentation
- Authors: Manuel Mager, \"Ozlem \c{C}etino\u{g}lu and Katharina Kann
- Abstract summary: Canonical morphological segmentation consists of dividing words into their standardized morphemes.
We explore two new models for the task, borrowing from the closely related area of morphological generation.
We find that, in the low-resource setting, the novel approaches outperform existing ones on all languages by up to 11.4% accuracy.
- Score: 23.17111619633273
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Canonical morphological segmentation consists of dividing words into their
standardized morphemes. Here, we are interested in approaches for the task when
training data is limited. We compare model performance in a simulated
low-resource setting for the high-resource languages German, English, and
Indonesian to experiments on new datasets for the truly low-resource languages
Popoluca and Tepehua. We explore two new models for the task, borrowing from
the closely related area of morphological generation: an LSTM pointer-generator
and a sequence-to-sequence model with hard monotonic attention trained with
imitation learning. We find that, in the low-resource setting, the novel
approaches outperform existing ones on all languages by up to 11.4% accuracy.
However, while accuracy in emulated low-resource scenarios is over 50% for all
languages, for the truly low-resource languages Popoluca and Tepehua, our best
model only obtains 37.4% and 28.4% accuracy, respectively. Thus, we conclude
that canonical segmentation is still a challenging task for low-resource
languages.
Related papers
- Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - TAMS: Translation-Assisted Morphological Segmentation [3.666125285899499]
We present a sequence-to-sequence model for canonical morpheme segmentation.
Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data.
While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.
arXiv Detail & Related papers (2024-03-21T21:23:35Z) - MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language.
We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z) - Machine Translation for Ge'ez Language [0.0]
Machine translation for low-resource languages such as Ge'ez faces challenges such as out-of-vocabulary words, domain mismatches, and lack of labeled training data.
We develop a multilingual neural machine translation (MNMT) model based on languages relatedness.
We also experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches.
arXiv Detail & Related papers (2023-11-24T14:55:23Z) - Low Resource Summarization using Pre-trained Language Models [1.26404863283601]
We propose a methodology for adapting self-attentive transformer-based architecture models (mBERT, mT5) for low-resource summarization.
Our adapted summarization model textiturT5 can capture contextual information of low resource language effectively with evaluation score (up to 46.35 ROUGE-1, 77 BERTScore) at par with state-of-the-art models in high resource language English.
arXiv Detail & Related papers (2023-10-04T13:09:39Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks.
When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result.
We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z) - The Importance of Context in Very Low Resource Language Modeling [3.734153902687548]
In very low resource scenarios, statistical n-gram language models outperform state-of-the-art neural models.
We introduce three methods to improve a neural model's performance in the low-resource setting.
arXiv Detail & Related papers (2022-05-10T11:19:56Z) - Can Character-based Language Models Improve Downstream Task Performance
in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi.
We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.