Related papers: Towards Automatic Evaluation and Selection of PHI De-identification Models via Multi-Agent Collaboration

Towards Automatic Evaluation and Selection of PHI De-identification Models via Multi-Agent Collaboration

URL: http://arxiv.org/abs/2510.16194v1
Date: Fri, 17 Oct 2025 20:06:31 GMT
Title: Towards Automatic Evaluation and Selection of PHI De-identification Models via Multi-Agent Collaboration
Authors: Guanchen Wu, Zuhui Chen, Yuzhang Xie, Carl Yang,
Abstract summary: TEAM-PHI is a multi-agent evaluation and selection framework.<n>It uses large language models (LLMs) to automatically measure de-identification quality.<n>It selects the best-performing model without heavy reliance on gold labels.
Score: 12.912307284471858
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Protected health information (PHI) de-identification is critical for enabling the safe reuse of clinical notes, yet evaluating and comparing PHI de-identification models typically depends on costly, small-scale expert annotations. We present TEAM-PHI, a multi-agent evaluation and selection framework that uses large language models (LLMs) to automatically measure de-identification quality and select the best-performing model without heavy reliance on gold labels. TEAM-PHI deploys multiple Evaluation Agents, each independently judging the correctness of PHI extractions and outputting structured metrics. Their results are then consolidated through an LLM-based majority voting mechanism that integrates diverse evaluator perspectives into a single, stable, and reproducible ranking. Experiments on a real-world clinical note corpus demonstrate that TEAM-PHI produces consistent and accurate rankings: despite variation across individual evaluators, LLM-based voting reliably converges on the same top-performing systems. Further comparison with ground-truth annotations and human evaluation confirms that the framework's automated rankings closely match supervised evaluation. By combining independent evaluation agents with LLM majority voting, TEAM-PHI offers a practical, secure, and cost-effective solution for automatic evaluation and best-model selection in PHI de-identification, even when ground-truth labels are limited.

Related papers

The Effect of Document Summarization on LLM-Based Relevance Judgments [8.796251181920914]
Large Language Models (LLMs) have recently been proposed as automated assessors.<n>We investigate how text summarization affects the reliability of LLM-based judgments.<n>Our findings show that summary-based judgments achieve comparable stability in systems' ranking to full-document judgments.
arXiv Detail & Related papers (2025-12-05T00:26:13Z)
AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment [12.9569411072262]
AutoBench is a fully automated and self-sustaining framework for evaluating Large Language Models (LLMs)<n>This paper provides a rigorous scientific validation of the AutoBench methodology, originally developed as an open-source project by eZecute S.R.L.
arXiv Detail & Related papers (2025-10-26T09:20:39Z)
Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling [26.377421806098187]
Large Language Models (LLMs) as automatic evaluators have attracted growing attention.<n>LLMs tend to favor responses generated by themselves, undermining the reliability of their judgments.<n>This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework.
arXiv Detail & Related papers (2025-10-09T12:32:31Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
CRACQ: A Multi-Dimensional Approach To Automated Document Assessment [0.0]
CRACQ is a multi-dimensional evaluation framework tailored to evaluate documents across f i v e specific traits: Coherence, Rigor, Appropriateness, Completeness, and Quality.<n>It integrates linguistic, semantic, and structural signals into a cumulative assessment, enabling both holistic and trait-level analysis.
arXiv Detail & Related papers (2025-09-26T17:01:54Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Skewed Score: A statistical framework to assess autograders [2.9645858732618238]
"LLM-as-a-judge", or autograders, offer a scalable alternative to human evaluation.<n>They have shown mixed reliability and may exhibit systematic biases.<n>We propose a statistical framework that enables researchers to simultaneously assess their autograders.
arXiv Detail & Related papers (2025-07-04T18:45:10Z)
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol for evaluation can significantly affect evaluation reliability and induce systematic biases.<n>We find that generator models can flip preferences by embedding distractor features.<n>We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.
arXiv Detail & Related papers (2025-04-20T19:05:59Z)
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models [79.41859481668618]
Large Language Models (LLMs) have significantly advanced the fact-checking studies.<n>Existing automated fact-checking evaluation methods rely on static datasets and classification metrics.<n>We introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities.
arXiv Detail & Related papers (2025-02-25T07:44:22Z)
PairBench: Are Vision-Language Models Reliable at Comparing What They See? [16.49586486795478]
We present PairBench, a framework to evaluate large vision language models (VLMs) for automatic evaluation depending on the task.<n>Our approach introduces four key metrics for reliable comparison: alignment with human annotations, consistency across pair ordering, distribution smoothness, and controllability through prompting.<n>Our analysis reveals that no model consistently excels across all metrics, with each demonstrating distinct strengths and weaknesses.
arXiv Detail & Related papers (2025-02-21T04:53:11Z)
Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [52.76508734756661]
Auto-PRE is an automatic evaluation framework inspired by the peer review process.<n>Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluators based on three core traits.<n> Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-10-16T06:06:06Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.