Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for
Underdocumented Languages
- URL: http://arxiv.org/abs/2203.09632v1
- Date: Thu, 17 Mar 2022 22:02:25 GMT
- Title: Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for
Underdocumented Languages
- Authors: Clarissa Forbes, Farhan Samir, Bruce Harold Oliver, Changbing Yang,
Edith Coates, Garrett Nicolai and Miikka Silfverberg
- Abstract summary: Hundreds of underserved languages have available data sources in the form of interlinear glossed text (IGT) from language documentation efforts.
We make the case that IGT data can be leveraged successfully provided that target language expertise is available.
We illustrate each step through a case study on developing a morphological reinflection system for the Tsimchianic language Gitksan.
- Score: 6.8708103492634836
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in NLP is driven by pretrained models leveraging massive
datasets and has predominantly benefited the world's political and economic
superpowers. Technologically underserved languages are left behind because they
lack such resources. Hundreds of underserved languages, nevertheless, have
available data sources in the form of interlinear glossed text (IGT) from
language documentation efforts. IGT remains underutilized in NLP work, perhaps
because its annotations are only semi-structured and often language-specific.
With this paper, we make the case that IGT data can be leveraged successfully
provided that target language expertise is available. We specifically advocate
for collaboration with documentary linguists. Our paper provides a roadmap for
successful projects utilizing IGT data: (1) It is essential to define which NLP
tasks can be accomplished with the given IGT data and how these will benefit
the speech community. (2) Great care and target language expertise is required
when converting the data into structured formats commonly employed in NLP. (3)
Task-specific and user-specific evaluation can help to ascertain that the tools
which are created benefit the target language speech community. We illustrate
each step through a case study on developing a morphological reinflection
system for the Tsimchianic language Gitksan.
Related papers
- Can we teach language models to gloss endangered languages? [10.698704803396723]
Interlinear glossed text (IGT) is a popular format in language documentation projects, where each morpheme is labeled with a descriptive annotation.
We explore whether large language models (LLMs) can be effective at the task of interlinear glossing with in-context learning, without any traditional training.
We find that LLM-based methods beat standard transformer baselines, despite requiring no training at all.
arXiv Detail & Related papers (2024-06-27T05:17:04Z) - Toward Informal Language Processing: Knowledge of Slang in Large Language Models [16.42982896928428]
We construct a dataset that supports evaluation on a diverse set of tasks pertaining to automatic processing of slang.
For both evaluation and finetuning, we show the effectiveness of our dataset on two core applications.
We find that while LLMs such as GPT-4 achieve good performance in a zero-shot setting, smaller BERT-like models finetuned on our dataset achieve comparable performance.
arXiv Detail & Related papers (2024-04-02T21:50:18Z) - Wav2Gloss: Generating Interlinear Glossed Text from Speech [78.64412090339044]
We propose Wav2Gloss, a task in which four linguistic annotation components are extracted automatically from speech.
We provide various baselines to lay the groundwork for future research on Interlinear Glossed Text generation from speech.
arXiv Detail & Related papers (2024-03-19T21:45:29Z) - GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text [39.846419973203744]
We compile the largest existing corpus of interlinear glossed text (IGT) data from a variety of sources, covering over 450k examples across 1.8k languages.
We normalize much of our data to follow a standard set of labels across languages.
As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus.
We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6%.
arXiv Detail & Related papers (2024-03-11T03:21:15Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - Expanding Pretrained Models to Thousands More Languages via
Lexicon-based Adaptation [133.7313847857935]
Our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.
For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively.
arXiv Detail & Related papers (2022-03-17T16:48:22Z) - LaoPLM: Pre-trained Language Models for Lao [3.2146309563776416]
Pre-trained language models (PLMs) can capture different levels of concepts in context and hence generate universal language representations.
Although PTMs have been widely used in most NLP applications, it is under-represented in Lao NLP research.
We construct a text classification dataset to alleviate the resource-scare situation of the Lao language.
We present the first transformer-based PTMs for Lao with four versions: BERT-small, BERT-base, ELECTRA-small and ELECTRA-base, and evaluate it over two downstream tasks: part-of-speech tagging and text classification.
arXiv Detail & Related papers (2021-10-12T11:13:07Z) - N-LTP: An Open-source Neural Language Technology Platform for Chinese [68.58732970171747]
textttN- is an open-source neural language technology platform supporting six fundamental Chinese NLP tasks.
textttN- adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks.
arXiv Detail & Related papers (2020-09-24T11:45:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.