Hire a Linguist!: Learning Endangered Languages with In-Context
Linguistic Descriptions
- URL: http://arxiv.org/abs/2402.18025v1
- Date: Wed, 28 Feb 2024 03:44:01 GMT
- Title: Hire a Linguist!: Learning Endangered Languages with In-Context
Linguistic Descriptions
- Authors: Kexun Zhang, Yee Man Choi, Zhenqiao Song, Taiqi He, William Yang Wang,
Lei Li
- Abstract summary: LINGOLLM is a training-free approach to enable an LLM to process unseen languages that hardly occur in its pre-training.
We implement LINGOLLM on top of two models, GPT-4 and Mixtral, and evaluate their performance on 5 tasks across 8 endangered or low-resource languages.
Our results show that LINGOLLM elevates translation capability from GPT-4's 0 to 10.5 BLEU for 10 language directions.
- Score: 52.95579788485411
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How can large language models (LLMs) process and translate endangered
languages? Many languages lack a large corpus to train a decent LLM; therefore
existing LLMs rarely perform well in unseen, endangered languages. On the
contrary, we observe that 2000 endangered languages, though without a large
corpus, have a grammar book or a dictionary. We propose LINGOLLM, a
training-free approach to enable an LLM to process unseen languages that hardly
occur in its pre-training. Our key insight is to demonstrate linguistic
knowledge of an unseen language in an LLM's prompt, including a dictionary, a
grammar book, and morphologically analyzed input text. We implement LINGOLLM on
top of two models, GPT-4 and Mixtral, and evaluate their performance on 5 tasks
across 8 endangered or low-resource languages. Our results show that LINGOLLM
elevates translation capability from GPT-4's 0 to 10.5 BLEU for 10 language
directions. Our findings demonstrate the tremendous value of linguistic
knowledge in the age of LLMs for endangered languages. Our data, code, and
model generations can be found at https://github.com/LLiLab/llm4endangeredlang.
Related papers
- LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback [61.23008372927665]
We introduce xLLMs-100, which scales the multilingual capabilities of LLaMA and BLOOM to 100 languages.
We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks.
arXiv Detail & Related papers (2024-06-03T20:25:12Z) - Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners [67.85635044939836]
Large Language Models (LLMs) have shown impressive language capabilities.
In this work, we investigate the spontaneous multilingual alignment improvement of LLMs.
We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages.
arXiv Detail & Related papers (2024-05-22T16:46:19Z) - Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the Sámi Language [7.289015788793582]
This work focuses on increasing technological participation for the S'ami language.
We draw the attention of the ML community towards the language modeling problem of Ultra Low Resource (ULR) languages.
We have compiled the available S'ami language resources from the web to create a clean dataset for training language models.
arXiv Detail & Related papers (2024-05-09T13:54:22Z) - Teaching Large Language Models an Unseen Language on the Fly [32.83773919852362]
We introduce DiPMT++, a framework for adapting LLMs to unseen languages by in-context learning.
Using a dictionary and 5K parallel sentences only, DiPMT++ significantly enhances the performance of GPT-4 from 0 to 16 BLEU for Chinese-to-Zhuang translation.
We also validate the effectiveness of our framework on Kalamang, another unseen language.
arXiv Detail & Related papers (2024-02-29T13:50:47Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - How Vocabulary Sharing Facilitates Multilingualism in LLaMA? [19.136382859468693]
Large Language Models (LLMs) often show strong performance on English tasks, while exhibiting limitations on other languages.
This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective.
arXiv Detail & Related papers (2023-11-15T16:13:14Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Chain-of-Dictionary Prompting Elicits Translation in Large Language Models [100.47154959254937]
Large language models (LLMs) have shown surprisingly good performance in multilingual neural machine translation (MNMT)
We present a novel method, CoD, which augments LLMs with prior knowledge with the chains of multilingual dictionaries for a subset of input words to elicit translation abilities.
arXiv Detail & Related papers (2023-05-11T05:19:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.