Multi-VALUE: A Framework for Cross-Dialectal English NLP
- URL: http://arxiv.org/abs/2212.08011v3
- Date: Tue, 30 May 2023 02:35:29 GMT
- Title: Multi-VALUE: A Framework for Cross-Dialectal English NLP
- Authors: Caleb Ziems, William Held, Jingfeng Yang, Jwala Dhamala, Rahul Gupta,
Diyi Yang
- Abstract summary: Multi- Dialect is a controllable rule-based translation system spanning 50 English dialects.
Stress tests reveal significant performance disparities for leading models on non-standard dialects.
We partner with native speakers of Chicano and Indian English to release new gold-standard variants of the popular CoQA task.
- Score: 49.55176102659081
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dialect differences caused by regional, social, and economic factors cause
performance discrepancies for many groups of language technology users.
Inclusive and equitable language technology must critically be dialect
invariant, meaning that performance remains constant over dialectal shifts.
Current systems often fall short of this ideal since they are designed and
tested on a single dialect: Standard American English (SAE). We introduce a
suite of resources for evaluating and achieving English dialect invariance. The
resource is called Multi-VALUE, a controllable rule-based translation system
spanning 50 English dialects and 189 unique linguistic features. Multi-VALUE
maps SAE to synthetic forms of each dialect. First, we use this system to
stress tests question answering, machine translation, and semantic parsing.
Stress tests reveal significant performance disparities for leading models on
non-standard dialects. Second, we use this system as a data augmentation
technique to improve the dialect robustness of existing systems. Finally, we
partner with native speakers of Chicano and Indian English to release new
gold-standard variants of the popular CoQA task. To execute the transformation
code, run model checkpoints, and download both synthetic and gold-standard
dialectal benchmark datasets, see http://value-nlp.org.
Related papers
- Literary and Colloquial Dialect Identification for Tamil using Acoustic Features [0.0]
Speech technology plays a role in preserving various dialects of a language from going extinct.
The current work proposes a way to identify two popular and broadly classified Tamil dialects.
arXiv Detail & Related papers (2024-08-27T09:00:27Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Task-Agnostic Low-Rank Adapters for Unseen English Dialects [52.88554155235167]
Large Language Models (LLMs) are trained on corpora disproportionally weighted in favor of Standard American English.
By disentangling dialect-specific and cross-dialectal information, HyperLoRA improves generalization to unseen dialects in a task-agnostic fashion.
arXiv Detail & Related papers (2023-11-02T01:17:29Z) - Towards dialect-inclusive recognition in a low-resource language: are
balanced corpora the answer? [5.1121440213561335]
This study is a diagnostic to quantify the effect of the speaker's dialect on recognition performance.
12 ASR systems were trained using dialect-balanced training corpora and modified versions of the baseline corpora.
Results indicate that dialect-balanced corpora do not yield a similar performance across the dialects.
There is a close relationship between Co and Mu dialects, but one that is not symmetrical.
arXiv Detail & Related papers (2023-07-14T12:18:38Z) - DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules [64.93179829965072]
DADA is a modular approach to imbue SAE-trained models with multi-dialectal robustness.
We show that DADA is effective for both single task and instruction fine language models.
arXiv Detail & Related papers (2023-05-22T18:43:31Z) - VALUE: Understanding Dialect Disparity in NLU [50.35526025326337]
We construct rules for 11 features of African American Vernacular English (AAVE)
We recruit fluent AAVE speakers to validate each feature transformation via linguistic acceptability judgments.
Experiments show that these new dialectal features can lead to a drop in model performance.
arXiv Detail & Related papers (2022-04-06T18:30:56Z) - Learning to Recognize Dialect Features [21.277962038423123]
We introduce the task of dialect feature detection, and present two multitask learning approaches.
We train our models on a small number of minimal pairs, building on how linguists typically define dialect features.
arXiv Detail & Related papers (2020-10-23T23:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.