Related papers: Performance Assessment Strategies for Generative AI Applications in Healthcare

Performance Assessment Strategies for Generative AI Applications in Healthcare

URL: http://arxiv.org/abs/2509.08087v1
Date: Tue, 09 Sep 2025 18:50:26 GMT
Title: Performance Assessment Strategies for Generative AI Applications in Healthcare
Authors: Victor Garcia, Mariia Sidulova, Aldo Badano,
Abstract summary: Generative artificial intelligence (GenAI) represent an emerging paradigm within artificial intelligence, with applications throughout the medical enterprise.<n>We discuss current state-of-the-art methodologies for assessing the performance of GenAI applications in healthcare and medical devices.
Score: 1.0486921990935787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative artificial intelligence (GenAI) represent an emerging paradigm within artificial intelligence, with applications throughout the medical enterprise. Assessing GenAI applications necessitates a comprehensive understanding of the clinical task and awareness of the variability in performance when implemented in actual clinical environments. Presently, a prevalent method for evaluating the performance of generative models relies on quantitative benchmarks. Such benchmarks have limitations and may suffer from train-to-the-test overfitting, optimizing performance for a specified test set at the cost of generalizability across other task and data distributions. Evaluation strategies leveraging human expertise and utilizing cost-effective computational models as evaluators are gaining interest. We discuss current state-of-the-art methodologies for assessing the performance of GenAI applications in healthcare and medical devices.

Related papers

Responsible Evaluation of AI for Mental Health [72.85175110624736]
Current approaches to evaluating AI tools in mental health care are fragmented and poorly aligned with clinical practice, social context, and first-hand user experience.<n>This paper argues for a rethinking of responsible evaluation by introducing an interdisciplinary framework that integrates clinical soundness, social context, and equity.
arXiv Detail & Related papers (2026-01-20T12:55:10Z)
MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition [6.191248426050678]
Therapeutic decision-making in clinical medicine requires robust, multi-step reasoning grounded in reliable biomedical knowledge.<n>Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG)<n>This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems.
arXiv Detail & Related papers (2025-12-12T16:01:48Z)
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z)
Evaluation Framework for AI Systems in "the Wild" [37.48117853114386]
Generative AI (GenAI) models have become vital across industries, yet current evaluation methods have not adapted to their widespread use.<n>Traditional evaluations often rely on benchmarks and fixed datasets, frequently failing to reflect real-world performance.<n>This white paper proposes a comprehensive framework for how we should evaluate real-world GenAI systems.
arXiv Detail & Related papers (2025-04-23T14:52:39Z)
Evaluating Generative AI-Enhanced Content: A Conceptual Framework Using Qualitative, Quantitative, and Mixed-Methods Approaches [0.0]
Generative AI (GenAI) has revolutionized content generation, offering transformative capabilities for improving language coherence, readability, and overall quality.<n>This manuscript explores the application of qualitative, quantitative, and mixed-methods research approaches to evaluate the performance of GenAI models in enhancing scientific writing.
arXiv Detail & Related papers (2024-11-26T23:34:07Z)
A Survey of Models for Cognitive Diagnosis: New Developments and Future Directions [66.40362209055023]
This paper aims to provide a survey of current models for cognitive diagnosis, with more attention on new developments using machine learning-based methods. By comparing the model structures, parameter estimation algorithms, model evaluation methods and applications, we provide a relatively comprehensive review of the recent trends in cognitive diagnosis models.
arXiv Detail & Related papers (2024-07-07T18:02:00Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types. Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z)
Explainable AI for clinical and remote health applications: a survey on tabular and time series data [3.655021726150368]
It is worth noting that XAI has not gathered the same attention across different research areas and data types, especially in healthcare. This paper provides a review of the literature in the last 5 years, illustrating the type of generated explanations and the efforts provided to evaluate their relevance and quality.
arXiv Detail & Related papers (2022-09-14T10:01:29Z)
Adversarial Sample Enhanced Domain Adaptation: A Case Study on Predictive Modeling with Electronic Health Records [57.75125067744978]
We propose a data augmentation method to facilitate domain adaptation. adversarially generated samples are used during domain adaptation. Results confirm the effectiveness of our method and the generality on different tasks.
arXiv Detail & Related papers (2021-01-13T03:20:20Z)
Human Activity Recognition using Wearable Sensors: Review, Challenges, Evaluation Benchmark [0.0]
We conduct an extensive literature review on top-performing techniques in human activity recognition based on wearable sensors. We apply a standardized evaluation benchmark on the state-of-the-art techniques using six publicly available data-sets. Also, we propose an experimental, improved approach that is a hybrid of enhanced handcrafted features and a neural network architecture.
arXiv Detail & Related papers (2021-01-05T17:33:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.