Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources
- URL: http://arxiv.org/abs/2501.09943v2
- Date: Fri, 07 Feb 2025 18:35:54 GMT
- Title: Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources
- Authors: Belu Ticona, Fernando Carranza, Viviana Cotik,
- Abstract summary: Argentina has a large yet little-known Indigenous linguistic diversity, encompassing at least 40 different languages.
We present a systematization of the Indigenous languages spoken in Argentina, classifying them into seven language families.
For each one, we present an estimation of the national Indigenous population size, based on the most recent Argentinian census.
- Score: 45.07333085270152
- License:
- Abstract: Argentina has a large yet little-known Indigenous linguistic diversity, encompassing at least 40 different languages. The majority of these languages are at risk of disappearing, resulting in a significant loss of world heritage and cultural knowledge. Currently, unified information on speakers and computational tools is lacking for these languages. In this work, we present a systematization of the Indigenous languages spoken in Argentina, classifying them into seven language families: Mapuche, Tup\'i-Guaran\'i, Guaycur\'u, Quechua, Mataco-Mataguaya, Aymara, and Chon. For each one, we present an estimation of the national Indigenous population size, based on the most recent Argentinian census. We discuss potential reasons why the census questionnaire design may underestimate the actual number of speakers. We also provide a concise survey of computational resources available for these languages, whether or not they were specifically developed for Argentinian varieties.
Related papers
- Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages [34.78841410279943]
Endangered languages, such as Navajo, are significantly underrepresented in contemporary language technologies.
This study evaluates Google's Language Identification (LangID) tool, which does not currently support any Native American languages.
arXiv Detail & Related papers (2025-01-27T04:43:18Z) - Harnessing Transfer Learning from Swahili: Advancing Solutions for Comorian Dialects [0.0]
We aim to pioneer NLP technologies for Comorian, a group of four languages or dialects belonging to the Bantu family.
Our approach is motivated by the hypothesis that if a human can understand a different language from their native language with little or no effort, it would be entirely possible to model this process on a machine.
arXiv Detail & Related papers (2024-12-09T22:47:41Z) - Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars [2.2061683015812026]
We develop, curate, and publicly release a set of corpora in several Mayan languages spoken in Guatemala and Southern Mexico.
The datasets are parallel with Spanish, the dominant language of the region.
We present neural machine translation models, trained on as many resources and Mayan languages as possible, and evaluated exclusively on our datasets.
arXiv Detail & Related papers (2024-04-11T12:09:47Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - \`It\`ak\'ur\`oso: Exploiting Cross-Lingual Transferability for Natural
Language Generation of Dialogues in Low-Resource, African Languages [0.9511471519043974]
We investigate the possibility of cross-lingual transfer from a state-of-the-art (SoTA) deep monolingual model to 6 African languages.
The languages are Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorub'a.
The results show that the hypothesis that deep monolingual models learn some abstractions that generalise across languages holds.
arXiv Detail & Related papers (2022-04-17T20:23:04Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.