Learning to Recognize Dialect Features
- URL: http://arxiv.org/abs/2010.12707v3
- Date: Thu, 6 May 2021 22:27:50 GMT
- Title: Learning to Recognize Dialect Features
- Authors: Dorottya Demszky, Devyani Sharma, Jonathan H. Clark, Vinodkumar
Prabhakaran, Jacob Eisenstein
- Abstract summary: We introduce the task of dialect feature detection, and present two multitask learning approaches.
We train our models on a small number of minimal pairs, building on how linguists typically define dialect features.
- Score: 21.277962038423123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building NLP systems that serve everyone requires accounting for dialect
differences. But dialects are not monolithic entities: rather, distinctions
between and within dialects are captured by the presence, absence, and
frequency of dozens of dialect features in speech and text, such as the
deletion of the copula in "He {} running". In this paper, we introduce the task
of dialect feature detection, and present two multitask learning approaches,
both based on pretrained transformers. For most dialects, large-scale annotated
corpora for these features are unavailable, making it difficult to train
recognizers. We train our models on a small number of minimal pairs, building
on how linguists typically define dialect features. Evaluation on a test set of
22 dialect features of Indian English demonstrates that these models learn to
recognize many features with high accuracy, and that a few minimal pairs can be
as effective for training as thousands of labeled examples. We also demonstrate
the downstream applicability of dialect feature detection both as a measure of
dialect density and as a dialect classifier.
Related papers
- Literary and Colloquial Dialect Identification for Tamil using Acoustic Features [0.0]
Speech technology plays a role in preserving various dialects of a language from going extinct.
The current work proposes a way to identify two popular and broadly classified Tamil dialects.
arXiv Detail & Related papers (2024-08-27T09:00:27Z) - Disentangling Dialect from Social Bias via Multitask Learning to Improve Fairness [16.746758715820324]
We present a multitask learning approach that models dialect language as an auxiliary task to incorporate syntactic and lexical variations.
In our experiments with African-American English dialect, we provide empirical evidence that complementing common learning approaches with dialect modeling improves their fairness.
Results suggest that multitask learning achieves state-of-the-art performance and helps to detect properties of biased language more reliably.
arXiv Detail & Related papers (2024-06-14T12:39:39Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Task-Agnostic Low-Rank Adapters for Unseen English Dialects [52.88554155235167]
Large Language Models (LLMs) are trained on corpora disproportionally weighted in favor of Standard American English.
By disentangling dialect-specific and cross-dialectal information, HyperLoRA improves generalization to unseen dialects in a task-agnostic fashion.
arXiv Detail & Related papers (2023-11-02T01:17:29Z) - DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules [64.93179829965072]
DADA is a modular approach to imbue SAE-trained models with multi-dialectal robustness.
We show that DADA is effective for both single task and instruction fine language models.
arXiv Detail & Related papers (2023-05-22T18:43:31Z) - Multi-VALUE: A Framework for Cross-Dialectal English NLP [49.55176102659081]
Multi- Dialect is a controllable rule-based translation system spanning 50 English dialects.
Stress tests reveal significant performance disparities for leading models on non-standard dialects.
We partner with native speakers of Chicano and Indian English to release new gold-standard variants of the popular CoQA task.
arXiv Detail & Related papers (2022-12-15T18:17:01Z) - A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech
Recognition [80.87085897419982]
We propose a novel acoustic modeling technique for accurate multi-dialect speech recognition with a single AM.
Our proposed AM is dynamically adapted based on both dialect information and its internal representation, which results in a highly adaptive AM for handling multiple dialects simultaneously.
The experimental results on large scale speech datasets show that the proposed AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11% relative compared to a single all-dialects AM and by 7.31% relative compared to dialect-specific AMs.
arXiv Detail & Related papers (2022-05-06T06:07:09Z) - Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect.
dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect.
We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z) - English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech
Recognition System [3.4888132404740797]
We evaluate a state-of-the-art automatic speech recognition model, using unseen data from a corpus with a wide variety of labeled English accents.
We show that there is indeed an accuracy bias in terms of accentual variety, favoring the accents most prevalent in the training corpus.
arXiv Detail & Related papers (2021-05-09T08:24:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.