Performance Assessment Strategies for Generative AI Applications in Healthcare
- URL: http://arxiv.org/abs/2509.08087v1
- Date: Tue, 09 Sep 2025 18:50:26 GMT
- Title: Performance Assessment Strategies for Generative AI Applications in Healthcare
- Authors: Victor Garcia, Mariia Sidulova, Aldo Badano,
- Abstract summary: Generative artificial intelligence (GenAI) represent an emerging paradigm within artificial intelligence, with applications throughout the medical enterprise.<n>We discuss current state-of-the-art methodologies for assessing the performance of GenAI applications in healthcare and medical devices.
- Score: 1.0486921990935787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative artificial intelligence (GenAI) represent an emerging paradigm within artificial intelligence, with applications throughout the medical enterprise. Assessing GenAI applications necessitates a comprehensive understanding of the clinical task and awareness of the variability in performance when implemented in actual clinical environments. Presently, a prevalent method for evaluating the performance of generative models relies on quantitative benchmarks. Such benchmarks have limitations and may suffer from train-to-the-test overfitting, optimizing performance for a specified test set at the cost of generalizability across other task and data distributions. Evaluation strategies leveraging human expertise and utilizing cost-effective computational models as evaluators are gaining interest. We discuss current state-of-the-art methodologies for assessing the performance of GenAI applications in healthcare and medical devices.
Related papers
- Responsible Evaluation of AI for Mental Health [72.85175110624736]
Current approaches to evaluating AI tools in mental health care are fragmented and poorly aligned with clinical practice, social context, and first-hand user experience.<n>This paper argues for a rethinking of responsible evaluation by introducing an interdisciplinary framework that integrates clinical soundness, social context, and equity.
arXiv Detail & Related papers (2026-01-20T12:55:10Z) - MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition [6.191248426050678]
Therapeutic decision-making in clinical medicine requires robust, multi-step reasoning grounded in reliable biomedical knowledge.<n>Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG)<n>This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems.
arXiv Detail & Related papers (2025-12-12T16:01:48Z) - Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z) - Evaluation Framework for AI Systems in "the Wild" [37.48117853114386]
Generative AI (GenAI) models have become vital across industries, yet current evaluation methods have not adapted to their widespread use.<n>Traditional evaluations often rely on benchmarks and fixed datasets, frequently failing to reflect real-world performance.<n>This white paper proposes a comprehensive framework for how we should evaluate real-world GenAI systems.
arXiv Detail & Related papers (2025-04-23T14:52:39Z) - Evaluating Generative AI-Enhanced Content: A Conceptual Framework Using Qualitative, Quantitative, and Mixed-Methods Approaches [0.0]
Generative AI (GenAI) has revolutionized content generation, offering transformative capabilities for improving language coherence, readability, and overall quality.<n>This manuscript explores the application of qualitative, quantitative, and mixed-methods research approaches to evaluate the performance of GenAI models in enhancing scientific writing.
arXiv Detail & Related papers (2024-11-26T23:34:07Z) - A Survey of Models for Cognitive Diagnosis: New Developments and Future Directions [66.40362209055023]
This paper aims to provide a survey of current models for cognitive diagnosis, with more attention on new developments using machine learning-based methods.
By comparing the model structures, parameter estimation algorithms, model evaluation methods and applications, we provide a relatively comprehensive review of the recent trends in cognitive diagnosis models.
arXiv Detail & Related papers (2024-07-07T18:02:00Z) - Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types.
Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z) - Explainable AI for clinical and remote health applications: a survey on
tabular and time series data [3.655021726150368]
It is worth noting that XAI has not gathered the same attention across different research areas and data types, especially in healthcare.
This paper provides a review of the literature in the last 5 years, illustrating the type of generated explanations and the efforts provided to evaluate their relevance and quality.
arXiv Detail & Related papers (2022-09-14T10:01:29Z) - Adversarial Sample Enhanced Domain Adaptation: A Case Study on
Predictive Modeling with Electronic Health Records [57.75125067744978]
We propose a data augmentation method to facilitate domain adaptation.
adversarially generated samples are used during domain adaptation.
Results confirm the effectiveness of our method and the generality on different tasks.
arXiv Detail & Related papers (2021-01-13T03:20:20Z) - Human Activity Recognition using Wearable Sensors: Review, Challenges,
Evaluation Benchmark [0.0]
We conduct an extensive literature review on top-performing techniques in human activity recognition based on wearable sensors.
We apply a standardized evaluation benchmark on the state-of-the-art techniques using six publicly available data-sets.
Also, we propose an experimental, improved approach that is a hybrid of enhanced handcrafted features and a neural network architecture.
arXiv Detail & Related papers (2021-01-05T17:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.