It's Morphin' Time! Combating Linguistic Discrimination with
Inflectional Perturbations
- URL: http://arxiv.org/abs/2005.04364v1
- Date: Sat, 9 May 2020 04:01:43 GMT
- Title: It's Morphin' Time! Combating Linguistic Discrimination with
Inflectional Perturbations
- Authors: Samson Tan, Shafiq Joty, Min-Yen Kan, Richard Socher
- Abstract summary: Only perfect Standard English corpora predisposes neural networks to discriminate against minorities from non-standard linguistic backgrounds.
We perturb the inflectional morphology of words to craft plausible and semantically similar adversarial examples.
- Score: 68.16751625956243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training on only perfect Standard English corpora predisposes pre-trained
neural networks to discriminate against minorities from non-standard linguistic
backgrounds (e.g., African American Vernacular English, Colloquial Singapore
English, etc.). We perturb the inflectional morphology of words to craft
plausible and semantically similar adversarial examples that expose these
biases in popular NLP models, e.g., BERT and Transformer, and show that
adversarially fine-tuning them for a single epoch significantly improves
robustness without sacrificing performance on clean data.
Related papers
- Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings.
We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z) - Detecting Bias in Large Language Models: Fine-tuned KcBERT [0.0]
We define such harm as societal bias and assess ethnic, gender, and racial biases in a model fine-tuned with Korean comments.
Our contribution lies in demonstrating that societal bias exists in Korean language models due to language-dependent characteristics.
arXiv Detail & Related papers (2024-03-16T02:27:19Z) - Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties [0.0]
We aim to address the issue of bias at its root - the data itself.
We curate a dataset of tweets from countries with high proportions of underserved English variety speakers.
Following best annotation practices, our growing corpus features 170,800 tweets taken from 7 countries.
arXiv Detail & Related papers (2024-01-21T13:18:20Z) - Sentence Representation Learning with Generative Objective rather than
Contrastive Objective [86.01683892956144]
We propose a novel generative self-supervised learning objective based on phrase reconstruction.
Our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-10-16T07:47:46Z) - Disambiguation of morpho-syntactic features of African American English
-- the case of habitual be [1.4699455652461728]
Habitual "be" is isomorphic, and therefore ambiguous, with other forms of "be" found in both AAE and other varieties of English.
We employ a combination of rule-based filters and data augmentation that generate a corpus balanced between habitual and non-habitual instances.
arXiv Detail & Related papers (2022-04-26T16:30:22Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial
Robustness? [121.57551065856164]
We propose Robust Informative Fine-Tuning (RIFT) as a novel adversarial fine-tuning method from an information-theoretical perspective.
RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process.
Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks.
arXiv Detail & Related papers (2021-12-22T05:04:41Z) - Adversarial Training with Contrastive Learning in NLP [0.0]
We propose adversarial training with contrastive learning (ATCL) to adversarially train a language processing task.
The core idea is to make linear perturbations in the embedding space of the input via fast gradient methods (FGM) and train the model to keep the original and perturbed representations close via contrastive learning.
The results show not only an improvement in the quantitative (perplexity and BLEU) scores when compared to the baselines, but ATCL also achieves good qualitative results in the semantic level for both tasks.
arXiv Detail & Related papers (2021-09-19T07:23:45Z) - Discriminatively-Tuned Generative Classifiers for Robust Natural
Language Inference [59.62779187457773]
We propose a generative classifier for natural language inference (NLI)
We compare it to five baselines, including discriminative models and large-scale pretrained language representation models like BERT.
Experiments show that GenNLI outperforms both discriminative and pretrained baselines across several challenging NLI experimental settings.
arXiv Detail & Related papers (2020-10-08T04:44:00Z) - Mind Your Inflections! Improving NLP for Non-Standard Englishes with
Base-Inflection Encoding [44.356771106881006]
Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English.
We propose Base-Inflection forms (BITE) to tokenize English text by reducing inflected words to their base.
We show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers.
arXiv Detail & Related papers (2020-04-30T15:15:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.