Related papers: Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective

Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective

URL: http://arxiv.org/abs/2506.19028v4
Date: Fri, 29 Aug 2025 16:20:31 GMT
Title: Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective
Authors: Weijie Xu, Yiwen Wang, Chi Xue, Xiangkun Hu, Xi Fang, Guimin Dong, Chandan K. Reddy,
Abstract summary: Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications.<n>We propose FiSCo (Fine-grained Semantic Comparison), a novel statistical framework to evaluate group-level fairness in LLMs.<n>We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities.
Score: 24.54292750583169
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications. Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs. To address these challenges, we propose FiSCo (Fine-grained Semantic Comparison), a novel statistical framework to evaluate group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike prior work focusing on sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis by operating at the claim level, leveraging entailment checks to assess the consistency of meaning across responses. We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities, enabling robust detection of subtle biases. We formalize a new group counterfactual fairness definition and validate FiSCo on both synthetic and human-annotated datasets spanning gender, race, and age. Experiments show that FiSCo more reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various evaluation metrics.

Related papers

Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z)
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses [32.58830706120845]
Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance.<n>We introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques.<n>We will publicly release our benchmark, aiming to establish a unified testbed for bias mitigation research.
arXiv Detail & Related papers (2025-09-30T19:56:54Z)
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs [6.121856629864516]
Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag.<n>Real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query.<n>We systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities.
arXiv Detail & Related papers (2025-09-04T08:43:27Z)
Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models [15.287990843387382]
Development of new cognitive tests based on novel image stimuli is challenging due to the lack of readily available normative data.<n>Recent advancements in generative multimodal large language models (MLLMs) offer a new approach to generate synthetic normative data from existing cognitive test images.
arXiv Detail & Related papers (2025-08-25T05:14:15Z)
When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs [4.265831047857601]
We investigate how large language models respond to prompts that differ only in their token-level realization but preserve the same semantic intent.<n>We propose Prompt-Based Semantic Shift (PBSS), a diagnostic framework for measuring behavioral drift in LLMs under semantically equivalent prompt rewordings.
arXiv Detail & Related papers (2025-06-11T18:26:32Z)
Statistical Hypothesis Testing for Auditing Robustness in Language Models [49.1574468325115]
We introduce distribution-based perturbation analysis, a framework that reformulates perturbation analysis as a frequentist hypothesis testing problem.<n>We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling.<n>We show how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models.
arXiv Detail & Related papers (2025-06-09T17:11:07Z)
Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness [49.35494016290887]
We show that equal performance across subgroups is an unreliable measure of fairness when data are representative of relevant populations but reflective of real-world disparities.<n>Our framework suggests complementing disaggregated evaluations with explicit causal assumptions and analysis to control for confounding and distribution shift.
arXiv Detail & Related papers (2025-06-04T17:40:31Z)
Relative Bias: A Comparative Framework for Quantifying Bias in LLMs [29.112649816695203]
Relative Bias is a method designed to assess how an LLM's behavior deviates from other LLMs within a specified target domain.<n>We introduce two complementary methodologies: (1) Embedding Transformation analysis, which captures relative bias patterns through sentence representations over the embedding space, and (2) LLM-as-a-Judge, which employs a language model to evaluate outputs comparatively.<n>Applying our framework to several case studies on bias and alignment scenarios following by statistical tests for validation, we find strong alignment between the two scoring methods.
arXiv Detail & Related papers (2025-05-22T01:59:54Z)
Metamorphic Testing for Fairness Evaluation in Large Language Models: Identifying Intersectional Bias in LLaMA and GPT [2.380039717474099]
Large Language Models (LLMs) have made significant strides in Natural Language Processing but remain vulnerable to fairness-related issues.<n>This paper introduces a metamorphic testing approach to systematically identify fairness bugs in LLMs.
arXiv Detail & Related papers (2025-04-04T21:04:14Z)
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models [79.41859481668618]
Large Language Models (LLMs) have significantly advanced the fact-checking studies.<n>Existing automated fact-checking evaluation methods rely on static datasets and classification metrics.<n>We introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities.
arXiv Detail & Related papers (2025-02-25T07:44:22Z)
Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.<n> Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z)
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.<n> Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
A comprehensive comparative evaluation and analysis of Distributional Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT. The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous. We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.