Related papers: Textual Entailment and Token Probability as Bias Evaluation Metrics

Textual Entailment and Token Probability as Bias Evaluation Metrics

URL: http://arxiv.org/abs/2510.07662v1
Date: Thu, 09 Oct 2025 01:30:35 GMT
Title: Textual Entailment and Token Probability as Bias Evaluation Metrics
Authors: Virginia K. Felkner, Allison Lim, Jonathan May,
Abstract summary: We test natural language inference (NLI) as a more realistic alternative bias metric.<n>We find that NLI metrics are more likely to detect "underdebiased" cases.<n>We conclude that neither token probability nor natural language inference is a "better" bias metric in all cases.
Score: 27.54174592324523
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Measurement of social bias in language models is typically by token probability (TP) metrics, which are broadly applicable but have been criticized for their distance from real-world langugage model use cases and harms. In this work, we test natural language inference (NLI) as a more realistic alternative bias metric. We show that, curiously, NLI and TP bias evaluation behave substantially differently, with very low correlation among different NLI metrics and between NLI and TP metrics. We find that NLI metrics are more likely to detect "underdebiased" cases. However, NLI metrics seem to be more brittle and sensitive to wording of counterstereotypical sentences than TP approaches. We conclude that neither token probability nor natural language inference is a "better" bias metric in all cases, and we recommend a combination of TP, NLI, and downstream bias evaluations to ensure comprehensive evaluation of language models. Content Warning: This paper contains examples of anti-LGBTQ+ stereotypes.

Related papers

BiasConnect: Investigating Bias Interactions in Text-to-Image Models [73.76853483463836]
We introduce BiasConnect, a novel tool designed to analyze and quantify bias interactions in Text-to-Image models.<n>Our method provides empirical estimates that indicate how other bias dimensions shift toward or away from an ideal distribution when a given bias is modified.<n>We demonstrate the utility of BiasConnect for selecting optimal bias mitigation axes, comparing different TTI models on the dependencies they learn, and understanding the amplification of intersectional societal biases in TTI models.
arXiv Detail & Related papers (2025-03-12T19:01:41Z)
Analyzing Correlations Between Intrinsic and Extrinsic Bias Metrics of Static Word Embeddings With Their Measuring Biases Aligned [8.673018064714547]
We examine the abilities of intrinsic bias metrics of static word embeddings to predict whether Natural Language Processing (NLP) systems exhibit biased behavior. A word embedding is one of the fundamental NLP technologies that represents the meanings of words through real vectors, and problematically, it also learns social biases such as stereotypes.
arXiv Detail & Related papers (2024-09-14T02:13:56Z)
Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation [49.3814117521631]
Standard benchmarks of bias and fairness in large language models (LLMs) measure the association between the user attributes stated or implied by a prompt.<n>We develop analogous RUTEd evaluations from three contexts of real-world use: children's bedtime stories, user personas, and English language learning exercises.<n>We find that standard bias metrics have no significant correlation with the more realistic bias metrics.
arXiv Detail & Related papers (2024-02-20T01:49:15Z)
GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community. The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability. We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z)
Evaluating Gender Bias of Pre-trained Language Models in Natural Language Inference by Considering All Labels [38.1620443730172]
Discriminatory gender biases have been found in Pre-trained Language Models (PLMs) for multiple languages. We propose a bias evaluation method for PLMs, called NLI-CoAL, which considers all the three labels of Natural Language Inference. We create datasets in English, Japanese, and Chinese, and successfully validate our bias measure across multiple languages.
arXiv Detail & Related papers (2023-09-18T12:02:21Z)
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)
The SAME score: Improved cosine based bias score for word embeddings [49.75878234192369]
We introduce SAME, a novel bias score for semantic bias in embeddings. We show that SAME is capable of measuring semantic bias and identify potential causes for social bias in downstream tasks.
arXiv Detail & Related papers (2022-03-28T09:28:13Z)
Evaluating Metrics for Bias in Word Embeddings [44.14639209617701]
We formalize a bias definition based on the ideas from previous works and derive conditions for bias metrics. We propose a new metric, SAME, to address the shortcomings of existing metrics and mathematically prove that SAME behaves appropriately.
arXiv Detail & Related papers (2021-11-15T16:07:15Z)
Intrinsic Bias Metrics Do Not Correlate with Application Bias [12.588713044749179]
This research examines whether easy-to-measure intrinsic metrics correlate well to real world extrinsic metrics. We measure both intrinsic and extrinsic bias across hundreds of trained models covering different tasks and experimental conditions. We advise that efforts to debias embedding spaces be always also paired with measurement of downstream model bias, and suggest that that community increase effort into making downstream measurement more feasible via creation of additional challenge sets and annotated test data.
arXiv Detail & Related papers (2020-12-31T18:59:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.