Shaping the Future of Endangered and Low-Resource Languages -- Our Role in the Age of LLMs: A Keynote at ECIR 2024
- URL: http://arxiv.org/abs/2409.13702v1
- Date: Thu, 5 Sep 2024 06:54:30 GMT
- Title: Shaping the Future of Endangered and Low-Resource Languages -- Our Role in the Age of LLMs: A Keynote at ECIR 2024
- Authors: Josiane Mothe,
- Abstract summary: Isidore of Seville is credited with the adage that it is language that gives birth to a people, and not the other way around.
Today, of the more than 7100 languages listed, a significant number are endangered.
- Score: 3.2362171533623054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Isidore of Seville is credited with the adage that it is language that gives birth to a people, and not the other way around , underlining the profound role played by language in the formation of cultural and social identity. Today, of the more than 7100 languages listed, a significant number are endangered. Since the 1970s, linguists, information seekers and enthusiasts have helped develop digital resources and automatic tools to support a wide range of languages, including endangered ones. The advent of Large Language Model (LLM) technologies holds both promise and peril. They offer unprecedented possibilities for the translation and generation of content and resources, key elements in the preservation and revitalisation of languages. They also present threat of homogenisation, cultural oversimplification and the further marginalisation of already vulnerable languages. The talk this paper is based on has proposed an initiatory journey, exploring the potential paths and partnerships between technology and tradition, with a particular focus on the Occitan language. Occitan is a language from Southern France, parts of Spain and Italy that played a major cultural and economic role, particularly in the Middle Ages. It is now endangered according to UNESCO. The talk critically has examined how human expertise and artificial intelligence can work together to offer hope for preserving the linguistic diversity that forms the foundation of our global and especially our European heritage while addressing some of the ethical and practical challenges that accompany the use of these powerful technologies. This paper is based on the keynote I gave at the 46th European Conference on Information Retrieval (ECIR 2024). As an alternative to reading this paper, a video talk is available online. 1 Date: 26 March 2024.
Related papers
- Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce [27.918975040084387]
Data in a given language should be viewed as more than a collection of tokens.
Good data collection and labeling practices are key to building more human-centered and socially aware technologies.
arXiv Detail & Related papers (2024-10-16T15:51:18Z) - Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs)
It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs.
It achieves superior results with much fewer computational resources compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z) - Socially Responsible Data for Large Multilingual Language Models [12.338723881042926]
Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years.
Various efforts are striving for models to accommodate languages of communities outside of the Global North.
arXiv Detail & Related papers (2024-09-08T23:51:04Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - Harnessing the Power of Artificial Intelligence to Vitalize Endangered Indigenous Languages: Technologies and Experiences [31.62071644137294]
We discuss the decreasing diversity of languages in the world and how working with Indigenous languages poses unique ethical challenges for AI and NLP.
We report encouraging results in the development of high-quality machine learning translators for Indigenous languages.
We present prototypes we have built in projects done in 2023 and 2024 with Indigenous communities in Brazil, aimed at facilitating writing.
arXiv Detail & Related papers (2024-07-17T14:46:37Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - Language Varieties of Italy: Technology Challenges and Opportunities [4.199528104335137]
Italy is characterized by a one-of-a-kind linguistic diversity landscape in Europe.
Most local languages and dialects in Italy are at risk of disappearing within few generations.
arXiv Detail & Related papers (2022-09-20T14:39:12Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Not always about you: Prioritizing community needs when developing
endangered language technology [5.670857685983896]
We discuss the unique technological, cultural, practical, and ethical challenges that researchers and indigenous speech community members face.
We report the perspectives of language teachers, Master Speakers and elders from indigenous communities, as well as the point of view of academics.
arXiv Detail & Related papers (2022-04-12T05:59:39Z) - A Summary of the First Workshop on Language Technology for Language
Documentation and Revitalization [70.14668193220528]
In August 2019, a workshop was held at Carnegie Mellon University to attempt to bring together language community members, documentary linguists, and technologists.
This paper reports the results of the workshop, including issues discussed, and various conceived and implemented technologies for nine languages.
arXiv Detail & Related papers (2020-04-27T22:55:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.