Do They Understand Them? An Updated Evaluation on Nonbinary Pronoun Handling in Large Language Models
- URL: http://arxiv.org/abs/2508.00788v1
- Date: Fri, 01 Aug 2025 17:11:42 GMT
- Title: Do They Understand Them? An Updated Evaluation on Nonbinary Pronoun Handling in Large Language Models
- Authors: Xushuo Tang, Yi Ding, Zhengyi Yang, Yin Chen, Yongrui Gu, Wenke Yang, Mingchen Ju, Xin Cao, Yongfei Liu, Wenjie Zhang,
- Abstract summary: Large language models (LLMs) are increasingly deployed in sensitive contexts where fairness and inclusivity are critical.<n>Pronoun usage, especially concerning gender-neutral and neopronouns, remains a key challenge for responsible AI.<n>We introduce MISGENDERED+, an extended and updated benchmark for evaluating LLMs' pronoun fidelity.
- Score: 13.89598383847666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are increasingly deployed in sensitive contexts where fairness and inclusivity are critical. Pronoun usage, especially concerning gender-neutral and neopronouns, remains a key challenge for responsible AI. Prior work, such as the MISGENDERED benchmark, revealed significant limitations in earlier LLMs' handling of inclusive pronouns, but was constrained to outdated models and limited evaluations. In this study, we introduce MISGENDERED+, an extended and updated benchmark for evaluating LLMs' pronoun fidelity. We benchmark five representative LLMs, GPT-4o, Claude 4, DeepSeek-V3, Qwen Turbo, and Qwen2.5, across zero-shot, few-shot, and gender identity inference. Our results show notable improvements compared with previous studies, especially in binary and gender-neutral pronoun accuracy. However, accuracy on neopronouns and reverse inference tasks remains inconsistent, underscoring persistent gaps in identity-sensitive reasoning. We discuss implications, model-specific observations, and avenues for future inclusive AI research.
Related papers
- Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation [65.92001420372007]
This paper systematically evaluates state-of-the-art MLLMs across diverse benchmarks.<n>We introduce the first benchmark GaslightingBench, specifically designed to evaluate the vulnerability of MLLMs to negation arguments.
arXiv Detail & Related papers (2025-01-31T10:37:48Z) - Mitigating Bias in Queer Representation within Large Language Models: A Collaborative Agent Approach [0.0]
Large Language Models (LLMs) often perpetuate biases in pronoun usage, leading to misrepresentation or exclusion of queer individuals.<n>This paper addresses the specific problem of biased pronoun usage in LLM outputs, particularly the inappropriate use of traditionally gendered pronouns.<n>We introduce a collaborative agent pipeline designed to mitigate these biases by analyzing and optimizing pronoun usage for inclusivity.
arXiv Detail & Related papers (2024-11-12T09:14:16Z) - Robust Pronoun Fidelity with English LLMs: Are they Reasoning, Repeating, or Just Biased? [26.583741801345507]
We present a dataset of over 5 million instances to measure pronoun fidelity in English.
Our results show that pronoun fidelity is not robust, in a simple, naturalistic setting where humans achieve nearly 100% accuracy.
arXiv Detail & Related papers (2024-04-04T01:07:14Z) - Evaluating Gender Bias in Large Language Models via Chain-of-Thought
Prompting [87.30837365008931]
Large language models (LLMs) equipped with Chain-of-Thought (CoT) prompting are able to make accurate incremental predictions even on unscalable tasks.
This study examines the impact of LLMs' step-by-step predictions on gender bias in unscalable tasks.
arXiv Detail & Related papers (2024-01-28T06:50:10Z) - Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies [75.85462924188076]
Gender-inclusive NLP research has documented the harmful limitations of gender binary-centric large language models (LLM)
We find that misgendering is significantly influenced by Byte-Pair (BPE) tokenization.
We propose two techniques: (1) pronoun tokenization parity, a method to enforce consistent tokenization across gendered pronouns, and (2) utilizing pre-existing LLM pronoun knowledge to improve neopronoun proficiency.
arXiv Detail & Related papers (2023-12-19T01:28:46Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - MISGENDERED: Limits of Large Language Models in Understanding Pronouns [46.276320374441056]
We evaluate popular language models for their ability to correctly use English gender-neutral pronouns.
We introduce MISGENDERED, a framework for evaluating large language models' ability to correctly use preferred pronouns.
arXiv Detail & Related papers (2023-06-06T18:27:52Z) - A Survey on Zero Pronoun Translation [69.09774294082965]
Zero pronouns (ZPs) are frequently omitted in pro-drop languages, but should be recalled in non-pro-drop languages.
This survey paper highlights the major works that have been undertaken in zero pronoun translation (ZPT) after the neural revolution.
We uncover a number of insightful findings such as: 1) ZPT is in line with the development trend of large language model; 2) data limitation causes learning bias in languages and domains; 3) performance improvements are often reported on single benchmarks, but advanced methods are still far from real-world use.
arXiv Detail & Related papers (2023-05-17T13:19:01Z) - Underspecification in Language Modeling Tasks: A Causality-Informed
Study of Gendered Pronoun Resolution [0.0]
We introduce a simple causal mechanism to describe the role underspecification plays in the generation of spurious correlations.
Despite its simplicity, our causal model directly informs the development of two lightweight black-box evaluation methods.
arXiv Detail & Related papers (2022-09-30T23:10:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.