Benchmark datasets driving artificial intelligence development fail to
capture the needs of medical professionals
- URL: http://arxiv.org/abs/2201.07040v1
- Date: Tue, 18 Jan 2022 15:05:28 GMT
- Title: Benchmark datasets driving artificial intelligence development fail to
capture the needs of medical professionals
- Authors: Kathrin Blagec, Jakob Kraiger, Wolfgang Fr\"uhwirt, Matthias Samwald
- Abstract summary: We released a catalogue of datasets and benchmarks pertaining to the broad domain of clinical and biomedical natural language processing (NLP)
A total of 450 NLP datasets were manually systematized and annotated with rich metadata.
Our analysis indicates that AI benchmarks of direct clinical relevance are scarce and fail to cover most work activities that clinicians want to see addressed.
- Score: 4.799783526620609
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Publicly accessible benchmarks that allow for assessing and comparing model
performances are important drivers of progress in artificial intelligence (AI).
While recent advances in AI capabilities hold the potential to transform
medical practice by assisting and augmenting the cognitive processes of
healthcare professionals, the coverage of clinically relevant tasks by AI
benchmarks is largely unclear. Furthermore, there is a lack of systematized
meta-information that allows clinical AI researchers to quickly determine
accessibility, scope, content and other characteristics of datasets and
benchmark datasets relevant to the clinical domain.
To address these issues, we curated and released a comprehensive catalogue of
datasets and benchmarks pertaining to the broad domain of clinical and
biomedical natural language processing (NLP), based on a systematic review of
literature and online resources. A total of 450 NLP datasets were manually
systematized and annotated with rich metadata, such as targeted tasks, clinical
applicability, data types, performance metrics, accessibility and licensing
information, and availability of data splits. We then compared tasks covered by
AI benchmark datasets with relevant tasks that medical practitioners reported
as highly desirable targets for automation in a previous empirical study.
Our analysis indicates that AI benchmarks of direct clinical relevance are
scarce and fail to cover most work activities that clinicians want to see
addressed. In particular, tasks associated with routine documentation and
patient data administration workflows are not represented despite significant
associated workloads. Thus, currently available AI benchmarks are improperly
aligned with desired targets for AI automation in clinical settings, and novel
benchmarks should be created to fill these gaps.
Related papers
- Large Language Model Benchmarks in Medical Tasks [11.196196955468992]
This paper presents a survey of various benchmark datasets employed in medical large language models (LLMs) tasks.
The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs.
The paper emphasizes the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis.
arXiv Detail & Related papers (2024-10-28T11:07:33Z) - Speaking the Same Language: Leveraging LLMs in Standardizing Clinical Data for AI [0.0]
This study delves into the adoption of large language models to address specific challenges, specifically, the standardization of healthcare data.
Our results illustrate that employing large language models significantly diminishes the necessity for manual data curation.
The proposed methodology has the propensity to expedite the integration of AI in healthcare, ameliorate the quality of patient care, whilst minimizing the time and financial resources necessary for the preparation of data for AI.
arXiv Detail & Related papers (2024-08-16T20:51:21Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - TrialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets [57.067409211231244]
This paper presents meticulously curated AIready datasets covering multi-modal data (e.g., drug molecule, disease code, text, categorical/numerical features) and 8 crucial prediction challenges in clinical trial design.
We provide basic validation methods for each task to ensure the datasets' usability and reliability.
We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design.
arXiv Detail & Related papers (2024-06-30T09:13:10Z) - ACR: A Benchmark for Automatic Cohort Retrieval [1.3547712404175771]
Current cohort retrieval methods rely on automated queries of structured data combined with manual curation.
Recent advancements in large language models (LLMs) and information retrieval (IR) offer promising avenues to revolutionize these systems.
This paper introduces a new task, Automatic Cohort Retrieval (ACR), and evaluates the performance of LLMs and commercial, domain-specific neuro-symbolic approaches.
arXiv Detail & Related papers (2024-06-20T23:04:06Z) - Challenges for Responsible AI Design and Workflow Integration in Healthcare: A Case Study of Automatic Feeding Tube Qualification in Radiology [35.284458448940796]
Nasogastric tubes (NGTs) are feeding tubes that are inserted through the nose into the stomach to deliver nutrition or medication.
Recent AI developments demonstrate the feasibility of robustly detecting NGT placement from Chest X-ray images.
We present a human-centered approach to the problem and describe insights derived following contextual inquiry and in-depth interviews with 15 clinical stakeholders.
arXiv Detail & Related papers (2024-05-08T14:16:22Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic
Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment.
In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials.
We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z) - Explainable AI for clinical and remote health applications: a survey on
tabular and time series data [3.655021726150368]
It is worth noting that XAI has not gathered the same attention across different research areas and data types, especially in healthcare.
This paper provides a review of the literature in the last 5 years, illustrating the type of generated explanations and the efforts provided to evaluate their relevance and quality.
arXiv Detail & Related papers (2022-09-14T10:01:29Z) - MedPerf: Open Benchmarking Platform for Medical Artificial Intelligence
using Federated Evaluation [110.31526448744096]
We argue that unlocking this potential requires a systematic way to measure the performance of medical AI models on large-scale heterogeneous data.
We are building MedPerf, an open framework for benchmarking machine learning in the medical domain.
arXiv Detail & Related papers (2021-09-29T18:09:41Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.