Related papers: Agree to Disagree? A Meta-Evaluation of LLM Misgendering

Agree to Disagree? A Meta-Evaluation of LLM Misgendering

URL: http://arxiv.org/abs/2504.17075v1
Date: Wed, 23 Apr 2025 19:52:02 GMT
Title: Agree to Disagree? A Meta-Evaluation of LLM Misgendering
Authors: Arjun Subramonian, Vagrant Gautam, Preethi Seshadri, Dietrich Klakow, Kai-Wei Chang, Yizhou Sun,
Abstract summary: We conduct a systematic meta-evaluation of probability- and generation-based evaluation methods for misgendering.<n>By automatically evaluating a suite of 6 models from 3 families, we find that these methods can disagree with each other at the instance, dataset, and model levels.<n>We also show that misgendering behaviour is complex and goes far beyond pronouns, suggesting essential disagreement with human evaluations.
Score: 84.77694174309183
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Numerous methods have been proposed to measure LLM misgendering, including probability-based evaluations (e.g., automatically with templatic sentences) and generation-based evaluations (e.g., with automatic heuristics or human validation). However, it has gone unexamined whether these evaluation methods have convergent validity, that is, whether their results align. Therefore, we conduct a systematic meta-evaluation of these methods across three existing datasets for LLM misgendering. We propose a method to transform each dataset to enable parallel probability- and generation-based evaluation. Then, by automatically evaluating a suite of 6 models from 3 families, we find that these methods can disagree with each other at the instance, dataset, and model levels, conflicting on 20.2% of evaluation instances. Finally, with a human evaluation of 2400 LLM generations, we show that misgendering behaviour is complex and goes far beyond pronouns, which automatic evaluations are not currently designed to capture, suggesting essential disagreement with human evaluations. Based on our findings, we provide recommendations for future evaluations of LLM misgendering. Our results are also more widely relevant, as they call into question broader methodological conventions in LLM evaluation, which often assume that different evaluation methods agree.

Related papers

An Empirical Study of Evaluating Long-form Question Answering [77.8023489322551]
We collect 5,236 factoid and non-factoid long-form answers generated by different large language models. We conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness. We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics.
arXiv Detail & Related papers (2025-04-25T15:14:25Z)
AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z)
Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models [47.545382591646565]
Large Language Models (LLMs) have excelled at language understanding and generating human-level text. LLMs are susceptible to adversarial attacks where malicious users prompt the model to generate undesirable text. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs.
arXiv Detail & Related papers (2024-08-07T17:11:34Z)
How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs? [3.1706553206969925]
We perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks. We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent. Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.
arXiv Detail & Related papers (2024-02-16T15:48:33Z)
Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization [9.686937153317809]
We perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of large language models. Using riSum, we analyze the agreement between evaluation methods and human judgment.
arXiv Detail & Related papers (2023-10-12T15:07:11Z)
Style Over Substance: Evaluation Biases for Large Language Models [17.13064447978519]
This study investigates the behavior of crowd-sourced and expert annotators, as well as large language models (LLMs) Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. We propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score.
arXiv Detail & Related papers (2023-07-06T14:42:01Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics. We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z)
PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems. Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective. We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.