Related papers: The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review

The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review

URL: http://arxiv.org/abs/2402.13635v1
Date: Wed, 21 Feb 2024 09:15:46 GMT
Title: The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review
Authors: Daniel Schwabe, Katinka Becker, Martin Seyferth, Andreas Kla{\ss}, Tobias Sch\"affter
Abstract summary: Development of trustworthy AI is especially important in medicine. We focus on the importance of data quality (training/test) in deep learning (DL) We propose the METRIC-framework, a specialised data quality framework for medical training data.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The adoption of machine learning (ML) and, more specifically, deep learning (DL) applications into all major areas of our lives is underway. The development of trustworthy AI is especially important in medicine due to the large implications for patients' lives. While trustworthiness concerns various aspects including ethical, technical and privacy requirements, we focus on the importance of data quality (training/test) in DL. Since data quality dictates the behaviour of ML products, evaluating data quality will play a key part in the regulatory approval of medical AI products. We perform a systematic review following PRISMA guidelines using the databases PubMed and ACM Digital Library. We identify 2362 studies, out of which 62 records fulfil our eligibility criteria. From this literature, we synthesise the existing knowledge on data quality frameworks and combine it with the perspective of ML applications in medicine. As a result, we propose the METRIC-framework, a specialised data quality framework for medical training data comprising 15 awareness dimensions, along which developers of medical ML applications should investigate a dataset. This knowledge helps to reduce biases as a major source of unfairness, increase robustness, facilitate interpretability and thus lays the foundation for trustworthy AI in medicine. Incorporating such systematic assessment of medical datasets into regulatory approval processes has the potential to accelerate the approval of ML products and builds the basis for new standards.

Related papers

Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z)
Evaluating LLMs in Medicine: A Call for Rigor, Transparency [2.2445597370194834]
Methods: Widely-used benchmark datasets, including MedQA, MedMCQA, PubMedQA, and MMLU, were reviewed for their rigor, transparency, and relevance to clinical scenarios.<n>Alternatives, such as challenge questions in medical journals, were also analyzed to identify their potential as unbiased evaluation tools.
arXiv Detail & Related papers (2025-07-11T16:09:25Z)
Medical Data Pecking: A Context-Aware Approach for Automated Quality Evaluation of Structured Medical Data [5.681039620785591]
EHR data often contain significant quality issues, including misrepresentations of subpopulations, biases, and systematic errors.<n> Existing quality assessment methods remain insufficient, lacking systematic procedures to assess data fitness for research.<n>We present the Medical Data Pecking approach, which adapts unit testing and coverage concepts from software engineering to identify data quality concerns.
arXiv Detail & Related papers (2025-07-03T13:54:50Z)
AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation [55.2739790399209]
We present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs.<n>The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation.
arXiv Detail & Related papers (2025-05-17T07:44:54Z)
Lightweight Clinical Decision Support System using QLoRA-Fine-Tuned LLMs and Retrieval-Augmented Generation [0.0]
This research paper investigates the application of Large Language Models (LLMs) in healthcare.<n>We focus on enhancing medical decision support through Retrieval-Augmented Generation (RAG) integrated with hospital-specific data and fine-tuning using Quantized Low-Rank Adaptation (QLoRA)<n>We touch on the ethical considerations-patient privacy, data security, and the need for rigorous clinical validation-as well as the practical challenges of integrating such systems into real-world healthcare.
arXiv Detail & Related papers (2025-05-06T10:31:54Z)
Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance. We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z)
Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models [0.0]
Medical Large Language Models (MLLMs) have demonstrated potential in healthcare applications. Their propensity for hallucinations presents substantial risks to patient care. This paper introduces MedHallBench, a comprehensive benchmark framework for evaluating and mitigating hallucinations in MLLMs.
arXiv Detail & Related papers (2024-12-25T16:51:29Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels [19.90354530235266]
We introduce a novel approach called Self-Learning Hypothetical Document Embeddings (SL-HyDE) to tackle this issue. SL-HyDE leverages large language models (LLMs) as generators to generate hypothetical documents based on a given query. We present the Chinese Medical Information Retrieval Benchmark (CMIRB), a comprehensive evaluation framework grounded in real-world medical scenarios.
arXiv Detail & Related papers (2024-10-26T02:53:20Z)
Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning. Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
Scorecards for Synthetic Medical Data Evaluation and Reporting [2.8262986891348056]
The growing utilization of synthetic medical data (SMD) in training and testing AI-driven tools in healthcare requires a systematic framework for assessing its quality. Here, we outline an evaluation framework designed to meet the unique requirements of medical applications. We introduce the concept of scorecards, which can serve as comprehensive reports that accompany artificially generated datasets.
arXiv Detail & Related papers (2024-06-17T02:11:59Z)
A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry [2.1717945745027425]
Large Language Models (LLMs) have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation. This comprehensive survey delineates the extensive application and requisite evaluation of LLMs within healthcare. Our survey is structured to provide an in-depth analysis of LLM applications across clinical settings, medical text data processing, research, education, and public health awareness.
arXiv Detail & Related papers (2024-04-24T09:55:24Z)
Large Language Models for Biomedical Knowledge Graph Construction: Information extraction from EMR notes [0.0]
We propose an end-to-end machine learning solution based on large language models (LLMs) The entities used in the KG construction process are diseases, factors, treatments, as well as manifestations that coexist with the patient while experiencing the disease. The application of the proposed methodology is demonstrated on age-related macular degeneration.
arXiv Detail & Related papers (2023-01-29T15:52:33Z)
Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals [4.799783526620609]
We released a catalogue of datasets and benchmarks pertaining to the broad domain of clinical and biomedical natural language processing (NLP) A total of 450 NLP datasets were manually systematized and annotated with rich metadata. Our analysis indicates that AI benchmarks of direct clinical relevance are scarce and fail to cover most work activities that clinicians want to see addressed.
arXiv Detail & Related papers (2022-01-18T15:05:28Z)
MedPerf: Open Benchmarking Platform for Medical Artificial Intelligence using Federated Evaluation [110.31526448744096]
We argue that unlocking this potential requires a systematic way to measure the performance of medical AI models on large-scale heterogeneous data. We are building MedPerf, an open framework for benchmarking machine learning in the medical domain.
arXiv Detail & Related papers (2021-09-29T18:09:41Z)
The Medkit-Learn(ing) Environment: Medical Decision Modelling through Simulation [81.72197368690031]
We present a new benchmarking suite designed specifically for medical sequential decision making. The Medkit-Learn(ing) Environment is a publicly available Python package providing simple and easy access to high-fidelity synthetic medical data.
arXiv Detail & Related papers (2021-06-08T10:38:09Z)
Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging. We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets. We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.