Related papers: Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties

Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties

URL: http://arxiv.org/abs/2401.11487v1
Date: Sun, 21 Jan 2024 13:18:20 GMT
Title: Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties
Authors: Nhi Pham, Lachlan Pham, Adam L. Meyers
Abstract summary: We aim to address the issue of bias at its root - the data itself. We curate a dataset of tweets from countries with high proportions of underserved English variety speakers. Following best annotation practices, our growing corpus features 170,800 tweets taken from 7 countries.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The prevalence of social media presents a growing opportunity to collect and analyse examples of English varieties. Whilst usage of these varieties was - and, in many cases, still is - used only in spoken contexts or hard-to-access private messages, social media sites like Twitter provide a platform for users to communicate informally in a scrapeable format. Notably, Indian English (Hinglish), Singaporean English (Singlish), and African-American English (AAE) can be commonly found online. These varieties pose a challenge to existing natural language processing (NLP) tools as they often differ orthographically and syntactically from standard English for which the majority of these tools are built. NLP models trained on standard English texts produced biased outcomes for users of underrepresented varieties. Some research has aimed to overcome the inherent biases caused by unrepresentative data through techniques like data augmentation or adjusting training models. We aim to address the issue of bias at its root - the data itself. We curate a dataset of tweets from countries with high proportions of underserved English variety speakers, and propose an annotation framework of six categorical classifications along a pseudo-spectrum that measures the degree of standard English and that thereby indirectly aims to surface the manifestations of English varieties in these tweets. Following best annotation practices, our growing corpus features 170,800 tweets taken from 7 countries, labeled by annotators who are from those countries and can communicate in regionally-dominant varieties of English. Our corpus highlights the accuracy discrepancies in pre-trained language identifiers between western English and non-western (i.e., less standard) English varieties. We hope to contribute to the growing literature identifying and reducing the implicit demographic discrepancies in NLP.

Related papers

BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English [8.823927892310238]
BESSTIE is a benchmark for sentiment and sarcasm classification for three varieties of English: Australian (en-AU), Indian (en-IN), and British (en-UK) We collect datasets for these language varieties using two methods: location-based and topic-based filtering. Native speakers of the language varieties manually annotate the datasets with sentiment and sarcasm labels.
arXiv Detail & Related papers (2024-12-06T02:34:40Z)
Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet. On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Cross-lingual Transfer Learning for Check-worthy Claim Identification over Twitter [7.601937548486356]
Misinformation spread over social media has become an undeniable infodemic. We present a systematic study of six approaches for cross-lingual check-worthiness estimation across pairs of five diverse languages with the help of Multilingual BERT (mBERT) model. Our results show that for some language pairs, zero-shot cross-lingual transfer is possible and can perform as good as monolingual models that are trained on the target language.
arXiv Detail & Related papers (2022-11-09T18:18:53Z)
Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text. This leads to hundreds of millions of foreign language tokens in large-scale datasets. We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z)
Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z)
Mitigating Racial Biases in Toxic Language Detection with an Equity-Based Ensemble Framework [9.84413545378636]
Recent research has demonstrated how racial biases against users who write African American English exist in popular toxic language datasets. We propose additional descriptive fairness metrics to better understand the source of these biases. We show that our proposed framework substantially reduces the racial biases that the model learns from these datasets.
arXiv Detail & Related papers (2021-09-27T15:54:05Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
Semi-automatic Generation of Multilingual Datasets for Stance Detection in Twitter [9.359018642178917]
This paper presents a method to obtain multilingual datasets for stance detection in Twitter. We leverage user-based information to semi-automatically label large amounts of tweets.
arXiv Detail & Related papers (2021-01-28T13:05:09Z)
Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations. We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)
Examining Racial Bias in an Online Abuse Corpus with Structural Topic Modeling [0.30458514384586405]
We use structural topic modeling to examine racial bias in social media posts. We augment the abusive language dataset by adding an additional feature indicating the predicted probability of the tweet being written in African-American English.
arXiv Detail & Related papers (2020-05-26T21:02:43Z)
It's Morphin' Time! Combating Linguistic Discrimination with Inflectional Perturbations [68.16751625956243]
Only perfect Standard English corpora predisposes neural networks to discriminate against minorities from non-standard linguistic backgrounds. We perturb the inflectional morphology of words to craft plausible and semantically similar adversarial examples.
arXiv Detail & Related papers (2020-05-09T04:01:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.