HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment
- URL: http://arxiv.org/abs/2510.12217v2
- Date: Thu, 16 Oct 2025 08:43:05 GMT
- Title: HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment
- Authors: Ali Mekky, Omar El Herraoui, Preslav Nakov, Yuxia Wang,
- Abstract summary: HALF (Harm-Aware LLM Fairness) is a framework that assesses model bias in realistic applications and weighs the outcomes by harm severity.<n>We show that HALF exposes a clear gap between previous benchmarking success and deployment readiness.
- Score: 52.374772443536045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-Aware LLM Fairness), a deployment-aligned framework that assesses model bias in realistic applications and weighs the outcomes by harm severity. HALF organizes nine application domains into three tiers (Severe, Moderate, Mild) using a five-stage pipeline. Our evaluation results across eight LLMs show that (1) LLMs are not consistently fair across domains, (2) model size or performance do not guarantee fairness, and (3) reasoning models perform better in medical decision support but worse in education. We conclude that HALF exposes a clear gap between previous benchmarking success and deployment readiness.
Related papers
- Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z) - Accept or Deny? Evaluating LLM Fairness and Performance in Loan Approval across Table-to-Text Serialization Approaches [57.5863675268117]
Large Language Models (LLMs) are increasingly employed in high-stakes decision-making tasks, such as loan approvals.<n>We assess the performance and fairness of LLMs on serialized loan approval datasets from Ghana, Germany, and the United States.
arXiv Detail & Related papers (2025-08-29T10:51:41Z) - FairReason: Balancing Reasoning and Social Bias in MLLMs [54.26091556079722]
Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities.<n>Recent studies explore advanced prompting schemes and post-training fine-tuning to push their reasoning ability further.
arXiv Detail & Related papers (2025-07-30T19:57:22Z) - Meta-Fair: AI-Assisted Fairness Testing of Large Language Models [2.9632404823837777]
Fairness is a core principle in the development of Artificial Intelligence (AI) systems.<n>Current approaches to fairness testing in large language models (LLMs) often rely on manual evaluation, fixed templates, deterministics, and curated datasets.<n>This work aims to lay the groundwork for a novel, automated method for testing fairness in LLMs.
arXiv Detail & Related papers (2025-07-03T11:20:59Z) - Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs [7.197702136906138]
We propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness.<n> observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset.<n>We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source AI systems.
arXiv Detail & Related papers (2025-05-29T20:45:18Z) - Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection.<n>Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities?<n>This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z) - Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility.
We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge.
Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Few-Shot Fairness: Unveiling LLM's Potential for Fairness-Aware
Classification [7.696798306913988]
We introduce a framework outlining fairness regulations aligned with various fairness definitions.
We explore the configuration for in-context learning and the procedure for selecting in-context demonstrations using RAG.
Experiments conducted with different LLMs indicate that GPT-4 delivers superior results in terms of both accuracy and fairness compared to other models.
arXiv Detail & Related papers (2024-02-28T17:29:27Z) - Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment [35.42539816648068]
This paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing large language models (LLMs)
The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness.
Results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness.
arXiv Detail & Related papers (2023-08-10T06:43:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.