Related papers: Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation

Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation

URL: http://arxiv.org/abs/2504.03716v1
Date: Sat, 29 Mar 2025 04:36:25 GMT
Title: Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation
Authors: Hannah Murray, Brian Hyeongseok Kim, Isabelle Lee, Jason Byun, Dani Yogatama, Evi Micha,
Abstract summary: Using organ allocation as a case study, we introduce two tasks: (1) Choose-One and (2) Rank-All.<n>In Rank-All, LLMs rank all candidates for a kidney, reflecting real-world allocation processes.<n>Since traditional fairness metrics do not account for ranking, we propose a novel application of Borda scoring to capture biases.
Score: 19.66750942418172
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are becoming ubiquitous, promising automation even in high-stakes scenarios. However, existing evaluation methods often fall short -- benchmarks saturate, accuracy-based metrics are overly simplistic, and many inherently ambiguous problems lack a clear ground truth. Given these limitations, evaluating fairness becomes complex. To address this, we reframe fairness evaluation using Borda scores, a method from voting theory, as a nuanced yet interpretable metric for measuring fairness. Using organ allocation as a case study, we introduce two tasks: (1) Choose-One and (2) Rank-All. In Choose-One, LLMs select a single candidate for a kidney, and we assess fairness across demographics using proportional parity. In Rank-All, LLMs rank all candidates for a kidney, reflecting real-world allocation processes. Since traditional fairness metrics do not account for ranking, we propose a novel application of Borda scoring to capture biases. Our findings highlight the potential of voting-based metrics to provide a richer, more multifaceted evaluation of LLM fairness.

Related papers

Metritocracy: Representative Metrics for Lite Benchmarks [3.0936354370614607]
We use ideas from social choice theory to formalize two notions of representation for the selection of a subset of evaluation metrics.<n>We first introduce positional representation, which guarantees every alternative is sufficiently represented at every position cutoff.<n>We then introduce positional proportionality, which guarantees no alternative is proportionally over- or under-represented by more than a small error at any position.
arXiv Detail & Related papers (2025-06-11T14:53:47Z)
Judging LLMs on a Simplex [2.088672652658465]
A common practice is to use large language models (LLMs) themselves as judges, but the theoretical properties of this approach are not yet well understood.<n>We show that a geometric framework that represents both judges and candidates as points on a probability simplex can provide helpful insight on what is or is not identifiable.
arXiv Detail & Related papers (2025-05-28T04:50:41Z)
MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered [2.8692611791027893]
We present MALIBU, a novel benchmark developed to assess the degree to which multi-agent systems implicitly reinforce social biases and stereotypes.<n>Our study quantifies biases in LLM-generated outputs, revealing that bias mitigation may favor marginalized personas over true neutrality.
arXiv Detail & Related papers (2025-04-10T19:16:40Z)
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications [0.0]
Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations.<n>Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open-ended text generation.<n>We propose a novel dynamic multi-agent system that automatically designs personalized LLM judges for various natural language generation applications.
arXiv Detail & Related papers (2025-04-01T09:36:56Z)
EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta [2.1249213103048414]
We introduce the EQUATOR Evaluator, which combines deterministic scoring with a focus on factual accuracy and robust reasoning assessment.<n>Using a vector database, EQUATOR pairs open-ended questions with human-evaluated answers, enabling more precise and scalable evaluations.<n>Our results demonstrate that this framework significantly outperforms traditional multiple-choice evaluations while maintaining high accuracy standards.
arXiv Detail & Related papers (2024-12-31T03:56:17Z)
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge [32.55871325700294]
Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP)<n>Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm.
arXiv Detail & Related papers (2024-11-25T17:28:44Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions [77.66677127535222]
Auto-Arena is an innovative framework that automates the entire evaluation process using LLM-powered agents. In our experiments, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks.
arXiv Detail & Related papers (2024-05-30T17:19:19Z)
The Impossibility of Fair LLMs [17.812295963158714]
We analyze a variety of technical fairness frameworks and find inherent challenges in each that make the development of a fair language model intractable.<n>We show that each framework either does not extend to the general-purpose AI context or is infeasible in practice.<n>These inherent challenges would persist for general-purpose AI, including LLMs, even if empirical challenges, such as limited participatory input and limited measurement methods, were overcome.
arXiv Detail & Related papers (2024-05-28T04:36:15Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition [46.949604465227054]
We propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition. MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs. The pairwise comparison results are then aggregated into a global ranking using the Elo rating system.
arXiv Detail & Related papers (2024-04-10T01:26:24Z)
Ranking Large Language Models without Ground Truth [24.751931637152524]
Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models. We provide a novel perspective where, given a dataset of prompts, we rank them without access to any ground truth or reference responses. Applying this idea repeatedly, we propose two methods to rank LLMs.
arXiv Detail & Related papers (2024-02-21T00:49:43Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.<n>This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)<n>We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations [10.709365940160685]
Modern large language models (LLMs) are hard to evaluate and compare automatically.<n>We propose a peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs.<n>We find that our approaches achieve higher accuracy and align better with human judgments.
arXiv Detail & Related papers (2023-07-06T04:05:44Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.