A Digital Corpus of St. Lawrence Island Yupik
- URL: http://arxiv.org/abs/2101.10496v1
- Date: Tue, 26 Jan 2021 00:14:00 GMT
- Title: A Digital Corpus of St. Lawrence Island Yupik
- Authors: Lane Schwartz and Emily Chen and Hyunji Hayley Park and Edward Jahn
and Sylvia L.R. Schreiner
- Abstract summary: St. Lawrence Island Yupik is an endangered polysynthetic language in the Inuit-Yupik language family indigenous to Alaska and Chukotka.
This work presents a step-by-step pipeline for the digitization of written texts, and the first publicly available digital corpus for St. Lawrence Island Yupik.
- Score: 8.961418142411487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: St. Lawrence Island Yupik (ISO 639-3: ess) is an endangered polysynthetic
language in the Inuit-Yupik language family indigenous to Alaska and Chukotka.
This work presents a step-by-step pipeline for the digitization of written
texts, and the first publicly available digital corpus for St. Lawrence Island
Yupik, created using that pipeline. This corpus has great potential for future
linguistic inquiry and research in NLP. It was also developed for use in Yupik
language education and revitalization, with a primary goal of enabling easy
access to Yupik texts by educators and by members of the Yupik community. A
secondary goal is to support development of language technology such as
spell-checkers, text-completion systems, interactive e-books, and language
learning apps for use by the Yupik community.
Related papers
- ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts [0.0]
We present the development and deployment of a linguistic corpus from Twitter posts in English.
The main goal was to create a fully annotated English corpus for linguistic analysis.
We include information on morphology and syntax, as well as NLP features such as tokenization, lemmas, and n- grams.
arXiv Detail & Related papers (2024-07-22T04:48:04Z) - Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus [0.9051256541674136]
This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus.
It is designed to bridge the technological gap in language learning and machine translation for under-resourced languages.
arXiv Detail & Related papers (2024-07-06T21:23:20Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Teacher Perception of Automatically Extracted Grammar Concepts for L2
Language Learning [66.79173000135717]
We apply this work to teaching two Indian languages, Kannada and Marathi, which do not have well-developed resources for second language learning.
We extract descriptions from a natural text corpus that answer questions about morphosyntax (learning of word order, agreement, case marking, or word formation) and semantics (learning of vocabulary).
We enlist the help of language educators from schools in North America to perform a manual evaluation, who find the materials have potential to be used for their lesson preparation and learner evaluation.
arXiv Detail & Related papers (2023-10-27T18:17:29Z) - Building an Endangered Language Resource in the Classroom: Universal
Dependencies for Kakataibo [0.8938910048099864]
We launch a new Universal Dependencies treebank for an endangered language from Amazonia: Kakataibo, a Panoan language spoken in Peru.
We first discuss the collaborative methodology implemented, which proved effective to create a treebank in the context of a Computational Linguistic course for undergraduates.
arXiv Detail & Related papers (2022-06-21T12:58:56Z) - The Open corpus of the Veps and Karelian languages: overview and
applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009.
The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search.
Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z) - How can NLP Help Revitalize Endangered Languages? A Case Study and
Roadmap for the Cherokee Language [91.79339725967073]
More than 43% of the languages spoken in the world are endangered.
In this work, we focus on discussing how NLP can help revitalize endangered languages.
We take Cherokee, a severely-endangered Native American language, as a case study.
arXiv Detail & Related papers (2022-04-25T18:25:57Z) - Including Signed Languages in Natural Language Processing [48.62744923724317]
Signed languages are the primary means of communication for many deaf and hard of hearing individuals.
This position paper calls on the NLP community to include signed languages as a research area with high social and scientific impact.
arXiv Detail & Related papers (2021-05-11T17:37:55Z) - A Summary of the First Workshop on Language Technology for Language
Documentation and Revitalization [70.14668193220528]
In August 2019, a workshop was held at Carnegie Mellon University to attempt to bring together language community members, documentary linguists, and technologists.
This paper reports the results of the workshop, including issues discussed, and various conceived and implemented technologies for nine languages.
arXiv Detail & Related papers (2020-04-27T22:55:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.