"I'm" Lost in Translation: Pronoun Missteps in Crowdsourced Data Sets
- URL: http://arxiv.org/abs/2304.13557v1
- Date: Sat, 22 Apr 2023 09:27:32 GMT
- Title: "I'm" Lost in Translation: Pronoun Missteps in Crowdsourced Data Sets
- Authors: Katie Seaborn, Yeongdae Kim
- Abstract summary: Crowdsourcing initiatives have focused on multilingual translation of big, open data sets for use in natural language processing (NLP)
We focus on the case of pronouns translated between English and Japanese in the crowdsourced Tatoeba database.
We found that masculine pronoun biases were present overall, even though plurality in language was accounted for in other ways.
- Score: 13.32560004325655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As virtual assistants continue to be taken up globally, there is an
ever-greater need for these speech-based systems to communicate naturally in a
variety of languages. Crowdsourcing initiatives have focused on multilingual
translation of big, open data sets for use in natural language processing
(NLP). Yet, language translation is often not one-to-one, and biases can
trickle in. In this late-breaking work, we focus on the case of pronouns
translated between English and Japanese in the crowdsourced Tatoeba database.
We found that masculine pronoun biases were present overall, even though
plurality in language was accounted for in other ways. Importantly, we detected
biases in the translation process that reflect nuanced reactions to the
presence of feminine, neutral, and/or non-binary pronouns. We raise the issue
of translation bias for pronouns and offer a practical solution to embed
plurality in NLP data sets.
Related papers
- Beyond Binary Gender: Evaluating Gender-Inclusive Machine Translation with Ambiguous Attitude Words [85.48043537327258]
Existing machine translation gender bias evaluations are primarily focused on male and female genders.
This study presents a benchmark AmbGIMT (Gender-Inclusive Machine Translation with Ambiguous attitude words)
We propose a novel process to evaluate gender bias based on the Emotional Attitude Score (EAS), which is used to quantify ambiguous attitude words.
arXiv Detail & Related papers (2024-07-23T08:13:51Z) - Investigating Markers and Drivers of Gender Bias in Machine Translations [0.0]
Implicit gender bias in large language models (LLMs) is a well-documented problem.
We use the DeepL translation API to investigate the bias evinced when repeatedly translating a set of 56 Software Engineering tasks.
We find that some languages display similar patterns of pronoun use, falling into three loose groups.
We identify the main verb appearing in a sentence as a likely significant driver of implied gender in the translations.
arXiv Detail & Related papers (2024-03-18T15:54:46Z) - On the Copying Problem of Unsupervised NMT: A Training Schedule with a
Language Discriminator Loss [120.19360680963152]
unsupervised neural machine translation (UNMT) has achieved success in many language pairs.
The copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs.
We propose a simple but effective training schedule that incorporates a language discriminator loss.
arXiv Detail & Related papers (2023-05-26T18:14:23Z) - Gender Lost In Translation: How Bridging The Gap Between Languages
Affects Gender Bias in Zero-Shot Multilingual Translation [12.376309678270275]
bridging the gap between languages for which parallel data is not available affects gender bias in multilingual NMT.
We study the effect of encouraging language-agnostic hidden representations on models' ability to preserve gender.
We find that language-agnostic representations mitigate zero-shot models' masculine bias, and with increased levels of gender inflection in the bridge language, pivoting surpasses zero-shot translation regarding fairer gender preservation for speaker-related gender agreement.
arXiv Detail & Related papers (2023-05-26T13:51:50Z) - What about em? How Commercial Machine Translation Fails to Handle
(Neo-)Pronouns [26.28827649737955]
Wrong pronoun translations can discriminate against marginalized groups, e.g., non-binary individuals.
We study how three commercial machine translation systems translate 3rd-person pronouns.
Our error analysis shows that the presence of a gender-neutral pronoun often leads to grammatical and semantic translation errors.
arXiv Detail & Related papers (2023-05-25T13:34:09Z) - The Best of Both Worlds: Combining Human and Machine Translations for
Multilingual Semantic Parsing with Active Learning [50.320178219081484]
We propose an active learning approach that exploits the strengths of both human and machine translations.
An ideal utterance selection can significantly reduce the error and bias in the translated data.
arXiv Detail & Related papers (2023-05-22T05:57:47Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Repairing Pronouns in Translation with BERT-Based Post-Editing [7.6344611819427035]
We show that in some domains, pronoun choice can account for more than half of a NMT systems' errors.
We then investigate a possible solution: fine-tuning BERT on a pronoun prediction task using chunks of source-side sentences.
arXiv Detail & Related papers (2021-03-23T21:01:03Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.