Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties
- URL: http://arxiv.org/abs/2401.11487v1
- Date: Sun, 21 Jan 2024 13:18:20 GMT
- Title: Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties
- Authors: Nhi Pham, Lachlan Pham, Adam L. Meyers
- Abstract summary: We aim to address the issue of bias at its root - the data itself.
We curate a dataset of tweets from countries with high proportions of underserved English variety speakers.
Following best annotation practices, our growing corpus features 170,800 tweets taken from 7 countries.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The prevalence of social media presents a growing opportunity to collect and
analyse examples of English varieties. Whilst usage of these varieties was -
and, in many cases, still is - used only in spoken contexts or hard-to-access
private messages, social media sites like Twitter provide a platform for users
to communicate informally in a scrapeable format. Notably, Indian English
(Hinglish), Singaporean English (Singlish), and African-American English (AAE)
can be commonly found online. These varieties pose a challenge to existing
natural language processing (NLP) tools as they often differ orthographically
and syntactically from standard English for which the majority of these tools
are built. NLP models trained on standard English texts produced biased
outcomes for users of underrepresented varieties. Some research has aimed to
overcome the inherent biases caused by unrepresentative data through techniques
like data augmentation or adjusting training models.
We aim to address the issue of bias at its root - the data itself. We curate
a dataset of tweets from countries with high proportions of underserved English
variety speakers, and propose an annotation framework of six categorical
classifications along a pseudo-spectrum that measures the degree of standard
English and that thereby indirectly aims to surface the manifestations of
English varieties in these tweets. Following best annotation practices, our
growing corpus features 170,800 tweets taken from 7 countries, labeled by
annotators who are from those countries and can communicate in
regionally-dominant varieties of English. Our corpus highlights the accuracy
discrepancies in pre-trained language identifiers between western English and
non-western (i.e., less standard) English varieties. We hope to contribute to
the growing literature identifying and reducing the implicit demographic
discrepancies in NLP.
Related papers
- Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Cross-lingual Transfer Learning for Check-worthy Claim Identification
over Twitter [7.601937548486356]
Misinformation spread over social media has become an undeniable infodemic.
We present a systematic study of six approaches for cross-lingual check-worthiness estimation across pairs of five diverse languages with the help of Multilingual BERT (mBERT) model.
Our results show that for some language pairs, zero-shot cross-lingual transfer is possible and can perform as good as monolingual models that are trained on the target language.
arXiv Detail & Related papers (2022-11-09T18:18:53Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - Can Character-based Language Models Improve Downstream Task Performance
in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi.
We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z) - Mitigating Racial Biases in Toxic Language Detection with an
Equity-Based Ensemble Framework [9.84413545378636]
Recent research has demonstrated how racial biases against users who write African American English exist in popular toxic language datasets.
We propose additional descriptive fairness metrics to better understand the source of these biases.
We show that our proposed framework substantially reduces the racial biases that the model learns from these datasets.
arXiv Detail & Related papers (2021-09-27T15:54:05Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Semi-automatic Generation of Multilingual Datasets for Stance Detection
in Twitter [9.359018642178917]
This paper presents a method to obtain multilingual datasets for stance detection in Twitter.
We leverage user-based information to semi-automatically label large amounts of tweets.
arXiv Detail & Related papers (2021-01-28T13:05:09Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Examining Racial Bias in an Online Abuse Corpus with Structural Topic
Modeling [0.30458514384586405]
We use structural topic modeling to examine racial bias in social media posts.
We augment the abusive language dataset by adding an additional feature indicating the predicted probability of the tweet being written in African-American English.
arXiv Detail & Related papers (2020-05-26T21:02:43Z) - It's Morphin' Time! Combating Linguistic Discrimination with
Inflectional Perturbations [68.16751625956243]
Only perfect Standard English corpora predisposes neural networks to discriminate against minorities from non-standard linguistic backgrounds.
We perturb the inflectional morphology of words to craft plausible and semantically similar adversarial examples.
arXiv Detail & Related papers (2020-05-09T04:01:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.