Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization
- URL: http://arxiv.org/abs/2510.00436v1
- Date: Wed, 01 Oct 2025 02:39:37 GMT
- Title: Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization
- Authors: Sarvesh Soni, Dina Demner-Fushman,
- Abstract summary: Current gold standard for evaluating AI responses is labor-intensive and slow.<n>We conducted a large systematic study of evaluation approaches.<n>Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems.
- Score: 8.450904497835262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched expert ratings. Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems and support patient-clinician communication.
Related papers
- Toward an AI Reasoning-Enabled System for Patient-Clinical Trial Matching [0.0]
We present a secure, scalable proof-of-concept system for Artificial Intelligence (AI)-augmented patient-trial matching.<n>The system generates structured eligibility assessments with interpretable reasoning chains that support human-in-the-loop review.<n>The system aims to reduce coordinator burden, intelligently broaden the set of trials considered for each patient and guarantee comprehensive auditability of all AI-generated outputs.
arXiv Detail & Related papers (2025-12-08T20:35:51Z) - AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation [55.2739790399209]
We present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs.<n>The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation.
arXiv Detail & Related papers (2025-05-17T07:44:54Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - ASTRID -- An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems [0.0]
Large Language Models (LLMs) have shown impressive potential in clinical question answering.<n>RAG is emerging as a leading approach for ensuring the factual accuracy of model responses.<n>Current automated RAG metrics perform poorly in clinical and conversational use cases.
arXiv Detail & Related papers (2025-01-14T15:46:39Z) - Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation [2.7379431425414693]
This paper explores the potential of using Large Language Models (LLMs) to automate the evaluation of responses in medical Question and Answer (Q&A) systems.
arXiv Detail & Related papers (2024-09-03T14:38:29Z) - Can Generative AI Support Patients' & Caregivers' Informational Needs? Towards Task-Centric Evaluation Of AI Systems [0.7124736158080937]
We develop an evaluation paradigm that centers human understanding and decision-making.<n>We study the utility of generative AI systems in supporting people in a concrete task.<n>We evaluate two state-of-the-art generative AI systems against the radiologist's responses.
arXiv Detail & Related papers (2024-01-31T23:24:37Z) - Evaluation of AI Chatbots for Patient-Specific EHR Questions [5.195779994399724]
We use several large language model (LLM) based systems: ChatGPT (versions 3.5 and 4), Google Bard, and Claude.
We evaluate the accuracy, relevance, comprehensiveness, and coherence of the answers generated by each model using a 5-point Likert scale on a set of patient-specific questions.
arXiv Detail & Related papers (2023-06-05T02:52:54Z) - Consultation Checklists: Standardising the Human Evaluation of Medical
Note Generation [58.54483567073125]
We propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists.
We observed good levels of inter-annotator agreement in a first evaluation study using the protocol.
arXiv Detail & Related papers (2022-11-17T10:54:28Z) - Medical Question Understanding and Answering with Knowledge Grounding
and Semantic Self-Supervision [53.692793122749414]
We introduce a medical question understanding and answering system with knowledge grounding and semantic self-supervision.
Our system is a pipeline that first summarizes a long, medical, user-written question, using a supervised summarization loss.
The system first matches the summarized user question with an FAQ from a trusted medical knowledge base, and then retrieves a fixed number of relevant sentences from the corresponding answer document.
arXiv Detail & Related papers (2022-09-30T08:20:32Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Opportunities of a Machine Learning-based Decision Support System for
Stroke Rehabilitation Assessment [64.52563354823711]
Rehabilitation assessment is critical to determine an adequate intervention for a patient.
Current practices of assessment mainly rely on therapist's experience, and assessment is infrequently executed due to the limited availability of a therapist.
We developed an intelligent decision support system that can identify salient features of assessment using reinforcement learning.
arXiv Detail & Related papers (2020-02-27T17:04:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.