A Summary of the First Workshop on Language Technology for Language
Documentation and Revitalization
- URL: http://arxiv.org/abs/2004.13203v1
- Date: Mon, 27 Apr 2020 22:55:55 GMT
- Title: A Summary of the First Workshop on Language Technology for Language
Documentation and Revitalization
- Authors: Graham Neubig, Shruti Rijhwani, Alexis Palmer, Jordan MacKenzie,
Hilaria Cruz, Xinjian Li, Matthew Lee, Aditi Chaudhary, Luke Gessler, Steven
Abney, Shirley Anugrah Hayati, Antonios Anastasopoulos, Olga Zamaraeva, Emily
Prud'hommeaux, Jennette Child, Sara Child, Rebecca Knowles, Sarah Moeller,
Jeffrey Micher, Yiyuan Li, Sydney Zink, Mengzhou Xia, Roshan S Sharma and
Patrick Littell
- Abstract summary: In August 2019, a workshop was held at Carnegie Mellon University to attempt to bring together language community members, documentary linguists, and technologists.
This paper reports the results of the workshop, including issues discussed, and various conceived and implemented technologies for nine languages.
- Score: 70.14668193220528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent advances in natural language processing and other language
technology, the application of such technology to language documentation and
conservation has been limited. In August 2019, a workshop was held at Carnegie
Mellon University in Pittsburgh to attempt to bring together language community
members, documentary linguists, and technologists to discuss how to bridge this
gap and create prototypes of novel and practical language revitalization
technologies. This paper reports the results of this workshop, including issues
discussed, and various conceived and implemented technologies for nine
languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw'ida, Kwak'wala,
Ojibwe, San Juan Quiahije Chatino, and Seneca.
Related papers
- Shaping the Future of Endangered and Low-Resource Languages -- Our Role in the Age of LLMs: A Keynote at ECIR 2024 [3.2362171533623054]
Isidore of Seville is credited with the adage that it is language that gives birth to a people, and not the other way around.
Today, of the more than 7100 languages listed, a significant number are endangered.
arXiv Detail & Related papers (2024-09-05T06:54:30Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal [0.0]
Kallaama project aims to produce and disseminate national languages corpora for speech technologies developments.
Project focuses on the 3 main spoken languages by Senegalese people: Wolof, Pulaar and Sereer.
We release a transcribed speech dataset containing 125 hours of recordings, about agriculture, in each of the above-mentioned languages.
arXiv Detail & Related papers (2024-04-02T14:31:14Z) - Building a Language-Learning Game for Brazilian Indigenous Languages: A Case of Study [0.0]
We describe a process to automatically generate language exercises and questions from a dependency treebank and a lexical database for Tupian languages.
We conclude that new data gathering processes should be established in partnership with indigenous communities and oriented for educational purposes.
arXiv Detail & Related papers (2024-03-21T16:11:44Z) - "It's how you do things that matters": Attending to Process to Better
Serve Indigenous Communities with Language Technologies [2.821682550792172]
This position paper explores ethical considerations in building NLP technologies for Indigenous languages.
We report on interviews with 17 researchers working in or with Aboriginal and/or Torres Strait Islander communities.
We recommend practices for NLP researchers to increase attention to the process of engagements with Indigenous communities.
arXiv Detail & Related papers (2024-02-04T23:23:51Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - The Open corpus of the Veps and Karelian languages: overview and
applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009.
The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search.
Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z) - How can NLP Help Revitalize Endangered Languages? A Case Study and
Roadmap for the Cherokee Language [91.79339725967073]
More than 43% of the languages spoken in the world are endangered.
In this work, we focus on discussing how NLP can help revitalize endangered languages.
We take Cherokee, a severely-endangered Native American language, as a case study.
arXiv Detail & Related papers (2022-04-25T18:25:57Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - A Digital Corpus of St. Lawrence Island Yupik [8.961418142411487]
St. Lawrence Island Yupik is an endangered polysynthetic language in the Inuit-Yupik language family indigenous to Alaska and Chukotka.
This work presents a step-by-step pipeline for the digitization of written texts, and the first publicly available digital corpus for St. Lawrence Island Yupik.
arXiv Detail & Related papers (2021-01-26T00:14:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.