Embedded Translations for Low-resource Automated Glossing
- URL: http://arxiv.org/abs/2403.08189v1
- Date: Wed, 13 Mar 2024 02:23:13 GMT
- Title: Embedded Translations for Low-resource Automated Glossing
- Authors: Changbing Yang, Garrett Nicolai, Miikka Silfverberg
- Abstract summary: We augment a hard-attentional neural model with embedded translation information extracted from interlinear glossed text.
We introduce a character-level decoder for generating glossed output.
Our results highlight the critical role of translation information in boosting the system's performance.
- Score: 11.964276799347642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate automatic interlinear glossing in low-resource settings. We
augment a hard-attentional neural model with embedded translation information
extracted from interlinear glossed text. After encoding these translations
using large language models, specifically BERT and T5, we introduce a
character-level decoder for generating glossed output. Aided by these
enhancements, our model demonstrates an average improvement of 3.97\%-points
over the previous state of the art on datasets from the SIGMORPHON 2023 Shared
Task on Interlinear Glossing. In a simulated ultra low-resource setting,
trained on as few as 100 sentences, our system achieves an average 9.78\%-point
improvement over the plain hard-attentional baseline. These results highlight
the critical role of translation information in boosting the system's
performance, especially in processing and interpreting modest data sources. Our
findings suggest a promising avenue for the documentation and preservation of
languages, with our experiments on shared task datasets indicating significant
advancements over the existing state of the art.
Related papers
- Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing [10.6453235045045]
We address the data scarcity problem in automatic data-driven glossing for low-resource languages by coordinating multiple sources of linguistic expertise.
Our enhancements lead to an average absolute improvement of 5%-points in word-level accuracy over the previous state-of-the-art.
arXiv Detail & Related papers (2024-06-16T22:01:15Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning [25.230786853723203]
We propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages.
We use Machine Translation to construct pseudo-parallel sentence pairs for low-resource languages.
We introduce a multi-view self-distillation method to learn noise-robust target-language representations.
arXiv Detail & Related papers (2022-08-26T09:32:24Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Fine-tuning BERT for Low-Resource Natural Language Understanding via
Active Learning [30.5853328612593]
In this work, we explore fine-tuning methods of BERT -- a pre-trained Transformer based language model.
Our experimental results show an advantage in model performance by maximizing the approximate knowledge gain of the model.
We analyze the benefits of freezing layers of the language model during fine-tuning to reduce the number of trainable parameters.
arXiv Detail & Related papers (2020-12-04T08:34:39Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.