Related papers: Audit Me If You Can: Query-Efficient Active Fairness Auditing of Black-Box LLMs

Audit Me If You Can: Query-Efficient Active Fairness Auditing of Black-Box LLMs

URL: http://arxiv.org/abs/2601.03087v1
Date: Tue, 06 Jan 2026 15:22:23 GMT
Title: Audit Me If You Can: Query-Efficient Active Fairness Auditing of Black-Box LLMs
Authors: David Hartmann, Lena Pohlmann, Lelia Hanslik, Noah Gießing, Bettina Berendt, Pieter Delobelle,
Abstract summary: Large Language Models (LLMs) exhibit systematic biases across demographic groups.<n>We conceptualise auditing as uncertainty estimation over a target fairness metric.<n>We introduce BAFA, the Bounded Active Fairness Auditor for query-efficient auditing of black-box LLMs.
Score: 4.673176641454931
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) exhibit systematic biases across demographic groups. Auditing is proposed as an accountability tool for black-box LLM applications, but suffers from resource-intensive query access. We conceptualise auditing as uncertainty estimation over a target fairness metric and introduce BAFA, the Bounded Active Fairness Auditor for query-efficient auditing of black-box LLMs. BAFA maintains a version space of surrogate models consistent with queried scores and computes uncertainty intervals for fairness metrics (e.g., $Δ$ AUC) via constrained empirical risk minimisation. Active query selection narrows these intervals to reduce estimation error. We evaluate BAFA on two standard fairness dataset case studies: \textsc{CivilComments} and \textsc{Bias-in-Bios}, comparing against stratified sampling, power sampling, and ablations. BAFA achieves target error thresholds with up to 40$\times$ fewer queries than stratified sampling (e.g., 144 vs 5,956 queries at $\varepsilon=0.02$ for \textsc{CivilComments}) for tight thresholds, demonstrates substantially better performance over time, and shows lower variance across runs. These results suggest that active sampling can reduce resources needed for independent fairness auditing with LLMs, supporting continuous model evaluations.

Related papers

LLM-as-Judge on a Budget [35.393598355979385]
We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities.<n>Our algorithm is shown to achieve a worst-case score-estimation error of $tildeOleft(sqrtfracsum_i=1K_i2Bright)$.<n>Experiments on emphSummarize-From-Feedback and emphHelpSteer2 demonstrate that our method significantly outperforms uniform allocation.
arXiv Detail & Related papers (2026-02-17T10:35:41Z)
Efficient Evaluation of LLM Performance with Statistical Guarantees [11.703733256169214]
We propose Factorized Active Querying (FAQ) for benchmarking large language models.<n>FAQ adaptively selects questions using a hybrid variance-reduction/active-learning sampling policy.<n>FAQ delivers up to $5times$ effective sample size gains over strong baselines on two benchmark suites.
arXiv Detail & Related papers (2026-01-28T04:59:20Z)
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses [32.58830706120845]
Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance.<n>We introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques.<n>We will publicly release our benchmark, aiming to establish a unified testbed for bias mitigation research.
arXiv Detail & Related papers (2025-09-30T19:56:54Z)
Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data [0.3593955557310285]
We investigate a self-supervised approach for estimating uncertainty from single-shot outputs using token-level features.<n>We show that the resulting uncertainty estimates are highly effective in detecting low-accuracy outputs.<n>This is achieved at a fraction of the computational cost, supporting a cost-effective integration of uncertainty measures into Entity Linking.
arXiv Detail & Related papers (2025-09-24T10:44:16Z)
Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling [59.133428586090226]
Large language models (LLMs) can often accurately describe probability distributions using natural language.<n>This mismatch limits their use in tasks requiring reliableity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making.<n>We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling.
arXiv Detail & Related papers (2025-06-11T17:59:58Z)
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs [7.197702136906138]
We propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness.<n> observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset.<n>We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source AI systems.
arXiv Detail & Related papers (2025-05-29T20:45:18Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [71.7892165868749]
Commercial Large Language Model (LLM) APIs create a fundamental trust problem.<n>Users pay for specific models but have no guarantee that providers deliver them faithfully.<n>We formalize this model substitution problem and evaluate detection methods under realistic adversarial conditions.<n>We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help? [20.998805709422292]
Test collections are information-retrieval tools that allow researchers to quickly and easily evaluate ranking algorithms.<n>As a cheaper alternative, recent studies have proposed using large language models (LLMs) to completely replace human assessors.<n>We propose LARA, an effective method to balance manual annotations with LLM annotations, helping build a rich and reliable test collection even under a low budget.
arXiv Detail & Related papers (2024-11-11T11:17:35Z)
Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG) We compute a factuality score that can be thresholded to yield a binary decision. Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z)
Unlocking the Power of LLM Uncertainty for Active In-Context Example Selection [6.813733517894384]
Uncertainty Tripartite Testing Paradigm (Unc-TTP) is a novel method for classifying Large Language Models (LLMs) uncertainty.<n>Unc-TTP performs three rounds of sampling under varying label injection interference, enumerating all possible outcomes.<n>Our experiments show that uncertainty examples selected via Unc-TTP are more informative than certainty examples.
arXiv Detail & Related papers (2024-08-17T11:33:23Z)
Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study [19.461541208547136]
This paper studies the impact of increasing the number of in-context examples on the consistency and quality of the evaluation results.<n> Experimental results show that advanced LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot and few-shot regimes.
arXiv Detail & Related papers (2024-06-17T15:11:58Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs [60.58434523646137]
A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency. We introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question. Our experiments show that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1%.
arXiv Detail & Related papers (2023-05-19T17:49:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.