Related papers: Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies

Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies

URL: http://arxiv.org/abs/2312.11779v3
Date: Sat, 6 Apr 2024 09:32:53 GMT
Title: Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies
Authors: Anaelia Ovalle, Ninareh Mehrabi, Palash Goyal, Jwala Dhamala, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Yuval Pinter, Rahul Gupta,
Abstract summary: Gender-inclusive NLP research has documented the harmful limitations of gender binary-centric large language models (LLM) We find that misgendering is significantly influenced by Byte-Pair (BPE) tokenization. We propose two techniques: (1) pronoun tokenization parity, a method to enforce consistent tokenization across gendered pronouns, and (2) utilizing pre-existing LLM pronoun knowledge to improve neopronoun proficiency.
Score: 75.85462924188076
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Gender-inclusive NLP research has documented the harmful limitations of gender binary-centric large language models (LLM), such as the inability to correctly use gender-diverse English neopronouns (e.g., xe, zir, fae). While data scarcity is a known culprit, the precise mechanisms through which scarcity affects this behavior remain underexplored. We discover LLM misgendering is significantly influenced by Byte-Pair Encoding (BPE) tokenization, the tokenizer powering many popular LLMs. Unlike binary pronouns, BPE overfragments neopronouns, a direct consequence of data scarcity during tokenizer training. This disparate tokenization mirrors tokenizer limitations observed in multilingual and low-resource NLP, unlocking new misgendering mitigation strategies. We propose two techniques: (1) pronoun tokenization parity, a method to enforce consistent tokenization across gendered pronouns, and (2) utilizing pre-existing LLM pronoun knowledge to improve neopronoun proficiency. Our proposed methods outperform finetuning with standard BPE, improving neopronoun accuracy from 14.1% to 58.4%. Our paper is the first to link LLM misgendering to tokenization and deficient neopronoun grammar, indicating that LLMs unable to correctly treat neopronouns as pronouns are more prone to misgender.

Related papers

Do They Understand Them? An Updated Evaluation on Nonbinary Pronoun Handling in Large Language Models [13.89598383847666]
Large language models (LLMs) are increasingly deployed in sensitive contexts where fairness and inclusivity are critical.<n>Pronoun usage, especially concerning gender-neutral and neopronouns, remains a key challenge for responsible AI.<n>We introduce MISGENDERED+, an extended and updated benchmark for evaluating LLMs' pronoun fidelity.
arXiv Detail & Related papers (2025-08-01T17:11:42Z)
A Multilingual, Culture-First Approach to Addressing Misgendering in LLM Applications [12.5856659067182]
Misgendering is the act of referring to someone by a gender that does not match their chosen identity. English-based approaches have clear-cut approaches to avoiding misgendering, such as the use of the pronoun they''
arXiv Detail & Related papers (2025-03-26T08:01:35Z)
Beyond Binary Gender: Evaluating Gender-Inclusive Machine Translation with Ambiguous Attitude Words [85.48043537327258]
Existing machine translation gender bias evaluations are primarily focused on male and female genders. This study presents a benchmark AmbGIMT (Gender-Inclusive Machine Translation with Ambiguous attitude words) We propose a novel process to evaluate gender bias based on the Emotional Attitude Score (EAS), which is used to quantify ambiguous attitude words.
arXiv Detail & Related papers (2024-07-23T08:13:51Z)
From 'Showgirls' to 'Performers': Fine-tuning with Gender-inclusive Language for Bias Reduction in LLMs [1.1049608786515839]
We adapt linguistic structures within Large Language Models to promote gender-inclusivity. The focus of our work is gender-exclusive affixes in English, such as in'show-girl' or'man-cave'
arXiv Detail & Related papers (2024-07-05T11:31:30Z)
Why Not Transform Chat Large Language Models to Non-English? [57.16587777261422]
The scarcity of non-English data limits the development of non-English large language models (LLMs) TransLLM divides the transfer problem into some common sub-tasks with the translation chain-of-thought. Our method, using only single-turn data, outperforms strong baselines and ChatGPT on multi-turn benchmark MT-bench.
arXiv Detail & Related papers (2024-05-22T18:53:25Z)
Transforming Dutch: Debiasing Dutch Coreference Resolution Systems for Non-binary Pronouns [5.5514102920271196]
Gender-neutral pronouns are increasingly being introduced across Western languages. Recent evaluations have demonstrated that English NLP systems are unable to correctly process gender-neutral pronouns. This paper examines a Dutch coreference resolution system's performance on gender-neutral pronouns.
arXiv Detail & Related papers (2024-04-30T18:31:19Z)
Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting [87.30837365008931]
Large language models (LLMs) equipped with Chain-of-Thought (CoT) prompting are able to make accurate incremental predictions even on unscalable tasks. This study examines the impact of LLMs' step-by-step predictions on gender bias in unscalable tasks.
arXiv Detail & Related papers (2024-01-28T06:50:10Z)
MISGENDERED: Limits of Large Language Models in Understanding Pronouns [46.276320374441056]
We evaluate popular language models for their ability to correctly use English gender-neutral pronouns. We introduce MISGENDERED, a framework for evaluating large language models' ability to correctly use preferred pronouns.
arXiv Detail & Related papers (2023-06-06T18:27:52Z)
"I'm fully who I am": Towards Centering Transgender and Non-Binary Voices to Measure Biases in Open Language Generation [69.25368160338043]
Transgender and non-binary (TGNB) individuals disproportionately experience discrimination and exclusion from daily life. We assess how the social reality surrounding experienced marginalization of TGNB persons contributes to and persists within Open Language Generation. We introduce TANGO, a dataset of template-based real-world text curated from a TGNB-oriented community.
arXiv Detail & Related papers (2023-05-17T04:21:45Z)
Welcome to the Modern World of Pronouns: Identity-Inclusive Natural Language Processing beyond Gender [23.92148222207458]
We provide an overview of 3rd person pronoun issues for Natural Language Processing. We evaluate existing and novel modeling approaches. We quantify the impact of a more discrimination-free approach on established benchmark data.
arXiv Detail & Related papers (2022-02-24T06:42:11Z)
First the worst: Finding better gender translations during beam search [19.921216907778447]
We focus on gender bias resulting from systematic errors in grammatical gender translation. We experiment with reranking nbest lists using gender features obtained automatically from the source sentence. We find that a combination of these techniques allows large gains in WinoMT accuracy without requiring additional bilingual data or an additional NMT model.
arXiv Detail & Related papers (2021-04-15T12:53:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.