High-quality Data-to-Text Generation for Severely Under-Resourced
Languages with Out-of-the-box Large Language Models
- URL: http://arxiv.org/abs/2402.12267v1
- Date: Mon, 19 Feb 2024 16:29:40 GMT
- Title: High-quality Data-to-Text Generation for Severely Under-Resourced
Languages with Out-of-the-box Large Language Models
- Authors: Michela Lorandi and Anya Belz
- Abstract summary: We explore the extent to which pretrained large language models (LLMs) can bridge the performance gap for under-resourced languages.
We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins.
For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English.
- Score: 5.632410663467911
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of NLP methods for severely under-resourced languages cannot
currently hope to match the state of the art in NLP methods for well resourced
languages. We explore the extent to which pretrained large language models
(LLMs) can bridge this gap, via the example of data-to-text generation for
Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced
languages and English, in a range of scenarios. We find that LLMs easily set
the state of the art for the under-resourced languages by substantial margins,
as measured by both automatic and human evaluations. For all our languages,
human evaluation shows on-a-par performance with humans for our best systems,
but BLEU scores collapse compared to English, casting doubt on the metric's
suitability for evaluating non-task-specific systems. Overall, our results
demonstrate the great potential of LLMs to bridge the performance gap for
under-resourced languages.
Related papers
- On Limitations of LLM as Annotator for Low Resource Languages [0.4194295877935868]
Low-resource languages face significant challenges due to the lack of sufficient linguistic data, resources, and tools for tasks such as supervised learning, annotation, and classification.
This shortage hinders the development of accurate models and datasets, making it difficult to perform critical NLP tasks like sentiment analysis or hate speech detection.
To bridge this gap, Large Language Models (LLMs) present an opportunity for potential annotators, capable of generating datasets and resources for these underrepresented languages.
arXiv Detail & Related papers (2024-11-26T17:55:37Z) - Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - LLMs for Extremely Low-Resource Finno-Ugric Languages [0.8192907805418583]
This paper addresses the gap by focusing on Voro, Livonian, and Komi.
We cover almost the entire cycle of LLM creation, from data collection to instruction tuning and evaluation.
We intend for this work to promote linguistic diversity, ensuring that lesser-resourced languages can benefit from advancements in NLP.
arXiv Detail & Related papers (2024-10-24T16:48:12Z) - Generating bilingual example sentences with large language models as lexicography assistants [2.6550899846546527]
We present a study of LLMs' performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels.
We evaluate the quality of LLM-generated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility.
arXiv Detail & Related papers (2024-10-04T06:45:48Z) - Quantifying Multilingual Performance of Large Language Models Across Languages [48.40607157158246]
Large Language Models (LLMs) perform better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate.
We propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations.
Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores.
arXiv Detail & Related papers (2024-04-17T16:53:16Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.