Quantifying the Dialect Gap and its Correlates Across Languages
- URL: http://arxiv.org/abs/2310.15135v1
- Date: Mon, 23 Oct 2023 17:42:01 GMT
- Title: Quantifying the Dialect Gap and its Correlates Across Languages
- Authors: Anjali Kantharuban, Ivan Vuli\'c, and Anna Korhonen
- Abstract summary: This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
- Score: 69.18461982439031
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Historically, researchers and consumers have noticed a decrease in quality
when applying NLP tools to minority variants of languages (i.e. Puerto Rican
Spanish or Swiss German), but studies exploring this have been limited to a
select few languages. Additionally, past studies have mainly been conducted in
a monolingual context, so cross-linguistic trends have not been identified and
tied to external factors. In this work, we conduct a comprehensive evaluation
of the most influential, state-of-the-art large language models (LLMs) across
two high-use applications, machine translation and automatic speech
recognition, to assess their functionality on the regional dialects of several
high- and low-resource languages. Additionally, we analyze how the regional
dialect gap is correlated with economic, social, and linguistic factors. The
impact of training data, including related factors like dataset size and its
construction procedure, is shown to be significant but not consistent across
models or languages, meaning a one-size-fits-all approach cannot be taken in
solving the dialect gap. This work will lay the foundation for furthering the
field of dialectal NLP by laying out evident disparities and identifying
possible pathways for addressing them through mindful data collection.
Related papers
- Connecting Ideas in 'Lower-Resource' Scenarios: NLP for National Varieties, Creoles and Other Low-resource Scenarios [11.460959151493055]
Despite excellent results on benchmarks over a small subset of languages, large language models struggle to process text from languages situated in lower-resource' scenarios.
This tutorial will identify common challenges, approaches, and themes in natural language processing (NLP) research for confronting and overcoming the obstacles inherent to data-poor contexts.
arXiv Detail & Related papers (2024-09-19T11:48:42Z) - Exploring Diachronic and Diatopic Changes in Dialect Continua: Tasks, Datasets and Challenges [2.572144535177391]
We critically assess nine tasks and datasets across five dialects from three language families (Slavic, Romance, and Germanic)
We outline five open challenges regarding changes in dialect use over time, the reliability of dialect datasets, the importance of speaker characteristics, limited coverage of dialects, and ethical considerations in data collection.
We hope that our work sheds light on future research towards inclusive computational methods and datasets for language varieties and dialects.
arXiv Detail & Related papers (2024-07-04T15:38:38Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Some Languages are More Equal than Others: Probing Deeper into the
Linguistic Disparity in the NLP World [2.0777058026628583]
Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently.
This paper provides a comprehensive analysis of the disparity that exists within the languages of the world.
arXiv Detail & Related papers (2022-10-16T12:50:30Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.