Debiasing Multilingual Word Embeddings: A Case Study of Three Indian
Languages
- URL: http://arxiv.org/abs/2107.10181v2
- Date: Thu, 22 Jul 2021 16:57:31 GMT
- Title: Debiasing Multilingual Word Embeddings: A Case Study of Three Indian
Languages
- Authors: Srijan Bansal, Vishal Garimella, Ayush Suhane, Animesh Mukherjee
- Abstract summary: We consider different methods to quantify bias and different debiasing approaches for monolingual as well as multilingual settings.
Our proposed methods establish the state-of-the-art performance for debiasing multilingual embeddings for three Indian languages.
- Score: 9.208381487410191
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we advance the current state-of-the-art method for debiasing
monolingual word embeddings so as to generalize well in a multilingual setting.
We consider different methods to quantify bias and different debiasing
approaches for monolingual as well as multilingual settings. We demonstrate the
significance of our bias-mitigation approach on downstream NLP applications.
Our proposed methods establish the state-of-the-art performance for debiasing
multilingual embeddings for three Indian languages - Hindi, Bengali, and Telugu
in addition to English. We believe that our work will open up new opportunities
in building unbiased downstream NLP applications that are inherently dependent
on the quality of the word embeddings used.
Related papers
- Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Investigating Bias in Multilingual Language Models: Cross-Lingual
Transfer of Debiasing Techniques [3.9673530817103333]
Cross-lingual transfer of debiasing techniques is not only feasible but also yields promising results.
Using translations of the CrowS-Pairs dataset, our analysis identifies SentenceDebias as the best technique across different languages.
arXiv Detail & Related papers (2023-10-16T11:43:30Z) - Multilingual BERT has an accent: Evaluating English influences on
fluency in multilingual models [23.62852626011989]
We show that grammatical structures in higher-resource languages bleed into lower-resource languages.
We show this bias via a novel method for comparing the fluency of multilingual models to the fluency of monolingual Spanish and Greek models.
arXiv Detail & Related papers (2022-10-11T17:06:38Z) - Evaluating the Diversity, Equity and Inclusion of NLP Technology: A Case
Study for Indian Languages [35.86100962711644]
In order for NLP technology to be widely applicable, fair, and useful, it needs to serve a diverse set of speakers across the world's languages.
We propose an evaluation paradigm that assesses NLP technologies across all three dimensions.
arXiv Detail & Related papers (2022-05-25T11:38:04Z) - Expanding Pretrained Models to Thousands More Languages via
Lexicon-based Adaptation [133.7313847857935]
Our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.
For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively.
arXiv Detail & Related papers (2022-03-17T16:48:22Z) - Mitigating Language-Dependent Ethnic Bias in BERT [11.977810781738603]
We study ethnic bias and how it varies across languages by analyzing and mitigating ethnic bias in monolingual BERT.
To observe and quantify ethnic bias, we develop a novel metric called Categorical Bias score.
We propose two methods for mitigation; first using a multilingual model, and second using contextual word alignment of two monolingual models.
arXiv Detail & Related papers (2021-09-13T04:52:41Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.