Mitigating Language-Dependent Ethnic Bias in BERT
- URL: http://arxiv.org/abs/2109.05704v2
- Date: Tue, 14 Sep 2021 06:27:36 GMT
- Title: Mitigating Language-Dependent Ethnic Bias in BERT
- Authors: Jaimeen Ahn and Alice Oh
- Abstract summary: We study ethnic bias and how it varies across languages by analyzing and mitigating ethnic bias in monolingual BERT.
To observe and quantify ethnic bias, we develop a novel metric called Categorical Bias score.
We propose two methods for mitigation; first using a multilingual model, and second using contextual word alignment of two monolingual models.
- Score: 11.977810781738603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: BERT and other large-scale language models (LMs) contain gender and racial
bias. They also exhibit other dimensions of social bias, most of which have not
been studied in depth, and some of which vary depending on the language. In
this paper, we study ethnic bias and how it varies across languages by
analyzing and mitigating ethnic bias in monolingual BERT for English, German,
Spanish, Korean, Turkish, and Chinese. To observe and quantify ethnic bias, we
develop a novel metric called Categorical Bias score. Then we propose two
methods for mitigation; first using a multilingual model, and second using
contextual word alignment of two monolingual models. We compare our proposed
methods with monolingual BERT and show that these methods effectively alleviate
the ethnic bias. Which of the two methods works better depends on the amount of
NLP resources available for that language. We additionally experiment with
Arabic and Greek to verify that our proposed methods work for a wider variety
of languages.
Related papers
- Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You [64.74707085021858]
We show that multilingual models suffer from significant gender biases just as monolingual models do.
We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models.
Our results show that not only do models exhibit strong gender biases but they also behave differently across languages.
arXiv Detail & Related papers (2024-01-29T12:02:28Z) - Comparing Biases and the Impact of Multilingual Training across Multiple
Languages [70.84047257764405]
We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task.
We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender.
Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture.
arXiv Detail & Related papers (2023-05-18T18:15:07Z) - An Analysis of Social Biases Present in BERT Variants Across Multiple
Languages [0.0]
We investigate the bias present in monolingual BERT models across a diverse set of languages.
We propose a template-based method to measure any kind of bias, based on sentence pseudo-likelihood.
We conclude that current methods of probing for bias are highly language-dependent.
arXiv Detail & Related papers (2022-11-25T23:38:08Z) - Multilingual BERT has an accent: Evaluating English influences on
fluency in multilingual models [23.62852626011989]
We show that grammatical structures in higher-resource languages bleed into lower-resource languages.
We show this bias via a novel method for comparing the fluency of multilingual models to the fluency of monolingual Spanish and Greek models.
arXiv Detail & Related papers (2022-10-11T17:06:38Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Debiasing Multilingual Word Embeddings: A Case Study of Three Indian
Languages [9.208381487410191]
We consider different methods to quantify bias and different debiasing approaches for monolingual as well as multilingual settings.
Our proposed methods establish the state-of-the-art performance for debiasing multilingual embeddings for three Indian languages.
arXiv Detail & Related papers (2021-07-21T16:12:51Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.