Related papers: Connecting Ideas in 'Lower-Resource' Scenarios: NLP for National Varieties, Creoles and Other Low-resource Scenarios

Connecting Ideas in 'Lower-Resource' Scenarios: NLP for National Varieties, Creoles and Other Low-resource Scenarios

URL: http://arxiv.org/abs/2409.12683v1
Date: Thu, 19 Sep 2024 11:48:42 GMT
Title: Connecting Ideas in 'Lower-Resource' Scenarios: NLP for National Varieties, Creoles and Other Low-resource Scenarios
Authors: Aditya Joshi, Diptesh Kanojia, Heather Lent, Hour Kaing, Haiyue Song,
Abstract summary: Despite excellent results on benchmarks over a small subset of languages, large language models struggle to process text from languages situated in lower-resource' scenarios. This tutorial will identify common challenges, approaches, and themes in natural language processing (NLP) research for confronting and overcoming the obstacles inherent to data-poor contexts.
Score: 11.460959151493055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite excellent results on benchmarks over a small subset of languages, large language models struggle to process text from languages situated in `lower-resource' scenarios such as dialects/sociolects (national or social varieties of a language), Creoles (languages arising from linguistic contact between multiple languages) and other low-resource languages. This introductory tutorial will identify common challenges, approaches, and themes in natural language processing (NLP) research for confronting and overcoming the obstacles inherent to data-poor contexts. By connecting past ideas to the present field, this tutorial aims to ignite collaboration and cross-pollination between researchers working in these scenarios. Our notion of `lower-resource' broadly denotes the outstanding lack of data required for model training - and may be applied to scenarios apart from the three covered in the tutorial.

Related papers

Scaling Test-time Compute for Low-resource Languages: Multilingual Reasoning in LLMs [3.9530780161144667]
We investigate the multilingual mechanism by which Large Language Models internally operate in a latent space biased toward their inherently dominant language. We train models to generate the chain-of-thought (CoT) in English while outputting the final response in the target language, given input in the low-resource language. Our experiments demonstrate that this approach, named English-Pivoted CoT Training, outperforms other baselines, with up to 28.33% improvement.
arXiv Detail & Related papers (2025-04-02T16:58:36Z)
Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages. Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer. In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language. We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios [48.57072884674938]
We propose a method to analyze language similarity using deep learning. Namely, we train a model on the Wilderness dataset and investigate how its latent space compares with classical language family findings.
arXiv Detail & Related papers (2020-12-01T22:44:42Z)
Low-Resource Adaptation of Neural NLP Models [0.30458514384586405]
This thesis investigates methods for dealing with low-resource scenarios in information extraction and natural language understanding. We develop and adapt neural NLP models to explore a number of research questions concerning NLP tasks with minimal or no training data.
arXiv Detail & Related papers (2020-11-09T12:13:55Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.