Related papers: Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing

Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing

URL: http://arxiv.org/abs/2406.11085v1
Date: Sun, 16 Jun 2024 22:01:15 GMT
Title: Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing
Authors: Changbing Yang, Garrett Nicolai, Miikka Silfverberg,
Abstract summary: We address the data scarcity problem in automatic data-driven glossing for low-resource languages by coordinating multiple sources of linguistic expertise. Our enhancements lead to an average absolute improvement of 5%-points in word-level accuracy over the previous state-of-the-art.
Score: 10.6453235045045
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we address the data scarcity problem in automatic data-driven glossing for low-resource languages by coordinating multiple sources of linguistic expertise. We supplement models with translations at both the token and sentence level as well as leverage the extensive linguistic capability of modern LLMs. Our enhancements lead to an average absolute improvement of 5%-points in word-level accuracy over the previous state of the art on a typologically diverse dataset spanning six low-resource languages. The improvements are particularly noticeable for the lowest-resourced language Gitksan, where we achieve a 10%-point improvement. Furthermore, in a simulated ultra-low resource setting for the same six languages, training on fewer than 100 glossed sentences, we establish an average 10%-point improvement in word-level accuracy over the previous state-of-the-art system.

Related papers

Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages [6.780441755361993]
This study spans five diverse target languages, three base LLMs, and seven downstream tasks spanning over 4,100 GPU training hours (9,900+ TFLOPs)<n>Our results show that the few-shot prompting and translate-test settings tend to heavily outperform the gradient-based adaptation methods.<n>To the extent of our knowledge, this is the largest study done on in-context learning for low-resource languages with respect to train compute and number of adaptation techniques considered.
arXiv Detail & Related papers (2025-06-23T23:22:11Z)
SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods [1.2091341579150698]
We release datasets of sentences containing polysemous words across ten low-resource languages.<n>To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method.<n>Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation.
arXiv Detail & Related papers (2025-05-29T17:48:08Z)
Revisiting Projection-based Data Transfer for Cross-Lingual Named Entity Recognition in Low-Resource Languages [8.612181075294327]
We show that the data-based cross-lingual transfer method is an effective technique for crosslingual NER. We present a novel formalized projection approach of matching source entities with extracted target candidates. These findings highlight the robustness of projection-based data transfer as an alternative to model-based methods for crosslingual named entity recognition in lowresource languages.
arXiv Detail & Related papers (2025-01-30T21:00:47Z)
Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition [2.7247388777405597]
We introduce a novel application of weighted cross-entropy, typically used for unbalanced datasets. We fine-tune the Whisper multilingual ASR model on five high-resource languages and one low-resource language.
arXiv Detail & Related papers (2024-09-25T14:09:09Z)
Low-Resource Named Entity Recognition with Cross-Lingual, Character-Level Neural Conditional Random Fields [68.17213992395041]
Low-resource named entity recognition is still an open problem in NLP. We present a transfer learning scheme, whereby we train character-level neural CRFs to predict named entities for both high-resource languages and low resource languages jointly.
arXiv Detail & Related papers (2024-04-14T23:44:49Z)
Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study [1.6819960041696331]
In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian. Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance. Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement.
arXiv Detail & Related papers (2024-04-12T06:16:26Z)
Embedded Translations for Low-resource Automated Glossing [11.964276799347642]
We augment a hard-attentional neural model with embedded translation information extracted from interlinear glossed text. We introduce a character-level decoder for generating glossed output. Our results highlight the critical role of translation information in boosting the system's performance.
arXiv Detail & Related papers (2024-03-13T02:23:13Z)
Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets. We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
Strategies for improving low resource speech to text translation relying on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST) We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z)
Learning Translation Quality Evaluation on Low Resource Languages from Large Language Models [4.168157981135698]
We show how knowledge can be distilled from Large Language Models (LLMs) to improve upon learned metrics without requiring human annotators. We show that the performance of a BLEURT-like model on lower resource languages can be improved in this way.
arXiv Detail & Related papers (2023-02-07T14:35:35Z)
Combining Pretrained High-Resource Embeddings and Subword Representations for Low-Resource Languages [24.775371434410328]
We explore techniques exploiting the qualities of morphologically rich languages (MRLs) We show that a meta-embedding approach combining both pretrained and morphologically-informed word embeddings performs best in the downstream task of Xhosa-English translation.
arXiv Detail & Related papers (2020-03-09T21:30:55Z)
Improving Candidate Generation for Low-resource Cross-lingual Entity Linking [81.41804263432684]
Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. In this paper, we propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios.
arXiv Detail & Related papers (2020-03-03T05:32:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.