Related papers: From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

URL: http://arxiv.org/abs/2507.17717v1
Date: Wed, 23 Jul 2025 17:28:31 GMT
Title: From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes
Authors: Karen Zhou, John Giorgi, Pranav Mani, Peng Xu, Davis Liang, Chenhao Tan,
Abstract summary: We propose a pipeline that distills real user feedback into structured checklists for note evaluation.<n>Using deidentified data from over 21,000 clinical encounters, we show that our feedback-derived checklist outperforms baseline approaches.<n>In offline research settings, the checklist can help identify notes likely to fall below our chosen quality thresholds.
Score: 26.750112195124284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters, prepared in accordance with the HIPAA safe harbor standard, from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms baseline approaches in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist's robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, the checklist can help identify notes likely to fall below our chosen quality thresholds.

Related papers

Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models [46.81512544528928]
We introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks.<n>Our framework deconstructs a benchmark's development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria.<n>Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness.
arXiv Detail & Related papers (2025-08-06T11:11:40Z)
Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench [0.0]
HealthBench is a benchmark designed to measure the capabilities of AI systems for health better.<n>Its reliance on expert opinion, rather than high-tier clinical evidence, risks codifying regional biases and individual clinician idiosyncrasies.<n>We propose anchoring reward functions in version-controlled Clinical Practice Guidelines that incorporate systematic reviews and GRADE evidence ratings.
arXiv Detail & Related papers (2025-07-31T18:16:10Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation [55.2739790399209]
We present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs.<n>The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation.
arXiv Detail & Related papers (2025-05-17T07:44:54Z)
TN-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes [3.9806397855028983]
Quality standards for behavioral therapy notes remain underdeveloped.<n>A rubric-based manual evaluation protocol offers more reliable and interpretable results than traditional Likert-scale annotations.<n>In a blind test, therapists prefer and judge LLM-generated notes to be superior to therapist-written notes.
arXiv Detail & Related papers (2025-03-26T15:40:40Z)
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation [31.061600616994145]
HDCEval is built on a set of fine-grained medical evaluation guidelines developed in collaboration with professional doctors.<n>The framework decomposes complex evaluation tasks into specialized subtasks, each evaluated by expert models.<n>This hierarchical approach ensures that each aspect of the evaluation is handled with expert precision, leading to a significant improvement in alignment with human evaluators.
arXiv Detail & Related papers (2025-01-12T07:30:49Z)
DocLens: Multi-aspect Fine-grained Evaluation for Medical Text Generation [37.58514130165496]
We propose a set of metrics to evaluate the completeness, conciseness, and attribution of generated medical text. The metrics can be computed by various types of evaluators including instruction-following (both proprietary and open-source) and supervised entailment models. A comprehensive human study shows that DocLens exhibits substantially higher agreement with the judgments of medical experts than existing metrics.
arXiv Detail & Related papers (2023-11-16T05:32:09Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
RECAP-KG: Mining Knowledge Graphs from Raw GP Notes for Remote COVID-19 Assessment in Primary Care [45.43645878061283]
We present a framework that performs knowledge graph construction from raw GP medical notes written during or after patient consultations. Our knowledge graphs include information about existing patient symptoms, their duration, and their severity. We apply our framework to consultation notes of COVID-19 patients in the UK.
arXiv Detail & Related papers (2023-06-17T23:35:51Z)
Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation [58.54483567073125]
We propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists. We observed good levels of inter-annotator agreement in a first evaluation study using the protocol.
arXiv Detail & Related papers (2022-11-17T10:54:28Z)
Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes. We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z)
A Methodology for Bi-Directional Knowledge-Based Assessment of Compliance to Continuous Application of Clinical Guidelines [1.52292571922932]
We introduce a new approach for automated guideline-based quality assessment of the care process. The BiKBAC method assesses the degree of compliance when applying clinical guidelines. The DiscovErr system was evaluated in a separate study in the type 2 diabetes management domain.
arXiv Detail & Related papers (2021-03-13T20:43:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.