Toward Systematic Counterfactual Fairness Evaluation of Large Language Models: The CAFFE Framework
- URL: http://arxiv.org/abs/2512.16816v1
- Date: Thu, 18 Dec 2025 17:56:07 GMT
- Title: Toward Systematic Counterfactual Fairness Evaluation of Large Language Models: The CAFFE Framework
- Authors: Alessandra Parziale, Gianmario Voria, Valeria Pontillo, Gemma Catolino, Andrea De Lucia, Fabio Palomba,
- Abstract summary: CAFFE is a structured and intent-aware framework for testing counterfactual fairness in Large Language Models.<n>Inspired by traditional non-functional testing, CAFFE formalizes LLM-Fairness test cases through explicitly defined components.
- Score: 45.94124349318318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, Large Language Models (LLMs) are foundational components of modern software systems. As their influence grows, concerns about fairness have become increasingly pressing. Prior work has proposed metamorphic testing to detect fairness issues, applying input transformations to uncover inconsistencies in model behavior. This paper introduces an alternative perspective for testing counterfactual fairness in LLMs, proposing a structured and intent-aware framework coined CAFFE (Counterfactual Assessment Framework for Fairness Evaluation). Inspired by traditional non-functional testing, CAFFE (1) formalizes LLM-Fairness test cases through explicitly defined components, including prompt intent, conversational context, input variants, expected fairness thresholds, and test environment configuration, (2) assists testers by automatically generating targeted test data, and (3) evaluates model responses using semantic similarity metrics. Our experiments, conducted on three different architectural families of LLM, demonstrate that CAFFE achieves broader bias coverage and more reliable detection of unfair behavior than existing metamorphic approaches.
Related papers
- On the use of graph models to achieve individual and group fairness [0.6299766708197883]
We provide a theoretical framework based on Sheaf Diffusion to leverage tools based on dynamical systems and homology to model fairness.<n>We present a collection of network topologies handling different fairness metrics, leading to a unified method capable of dealing with both individual and group bias.<n>The paper showcases the performance of the proposed models in terms of accuracy and fairness.
arXiv Detail & Related papers (2026-01-13T18:17:43Z) - Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check [60.77691669644931]
We propose Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models.<n>We show that FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning.<n>These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
arXiv Detail & Related papers (2025-10-14T20:50:30Z) - Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective [24.54292750583169]
Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications.<n>We propose FiSCo (Fine-grained Semantic Comparison), a novel statistical framework to evaluate group-level fairness in LLMs.<n>We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities.
arXiv Detail & Related papers (2025-06-23T18:31:22Z) - Metamorphic Testing for Fairness Evaluation in Large Language Models: Identifying Intersectional Bias in LLaMA and GPT [2.380039717474099]
Large Language Models (LLMs) have made significant strides in Natural Language Processing but remain vulnerable to fairness-related issues.<n>This paper introduces a metamorphic testing approach to systematically identify fairness bugs in LLMs.
arXiv Detail & Related papers (2025-04-04T21:04:14Z) - Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.<n> Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z) - On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [68.62012304574012]
multimodal generative models have sparked critical discussions on their reliability, fairness and potential for misuse.<n>We propose an evaluation framework to assess model reliability by analyzing responses to global and local perturbations in the embedding space.<n>Our method lays the groundwork for detecting unreliable, bias-injected models and tracing the provenance of embedded biases.
arXiv Detail & Related papers (2024-11-21T09:46:55Z) - A Normative Framework for Benchmarking Consumer Fairness in Large Language Model Recommender System [9.470545149911072]
This paper proposes a normative framework to benchmark consumer fairness in LLM-powered recommender systems.
We argue that this gap can lead to arbitrary conclusions about fairness.
Experiments on the MovieLens dataset on consumer fairness reveal fairness deviations in age-based recommendations.
arXiv Detail & Related papers (2024-05-03T16:25:27Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Learning Fair Classifiers via Min-Max F-divergence Regularization [13.81078324883519]
We introduce a novel min-max F-divergence regularization framework for learning fair classification models.
We show that F-divergence measures possess convexity and differentiability properties.
We show that the proposed framework achieves state-of-the-art performance with respect to the trade-off between accuracy and fairness.
arXiv Detail & Related papers (2023-06-28T20:42:04Z) - Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features.
We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.