Accelerating Clinical Evidence Synthesis with Large Language Models
- URL: http://arxiv.org/abs/2406.17755v1
- Date: Tue, 25 Jun 2024 17:41:52 GMT
- Title: Accelerating Clinical Evidence Synthesis with Large Language Models
- Authors: Zifeng Wang, Lang Cao, Benjamin Danek, Yichi Zhang, Qiao Jin, Zhiyong Lu, Jimeng Sun,
- Abstract summary: We introduce TrialMind, a generative AI-based pipeline for conducting medical systematic reviews.
We utilize large language models (LLMs) to drive each pipeline component while incorporating human expert oversight to minimize errors.
Our results demonstrate that TrialMind significantly improves the literature review process.
- Score: 32.27109309792286
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic medical discovery by AI is a dream of many. One step toward that goal is to create an AI model to understand clinical studies and synthesize clinical evidence from the literature. Clinical evidence synthesis currently relies on systematic reviews of clinical trials and retrospective analyses from medical literature. However, the rapid expansion of publications presents challenges in efficiently identifying, summarizing, and updating evidence. We introduce TrialMind, a generative AI-based pipeline for conducting medical systematic reviews, encompassing study search, screening, and data extraction phases. We utilize large language models (LLMs) to drive each pipeline component while incorporating human expert oversight to minimize errors. To facilitate evaluation, we also create a benchmark dataset TrialReviewBench, a custom dataset with 870 annotated clinical studies from 25 meta-analysis papers across various medical treatments. Our results demonstrate that TrialMind significantly improves the literature review process, achieving high recall rates (0.897-1.000) in study searching from over 20 million PubMed studies and outperforming traditional language model embeddings-based methods in screening (Recall@20 of 0.227-0.246 vs. 0.000-0.102). Furthermore, our approach surpasses direct GPT-4 performance in result extraction, with accuracy ranging from 0.65 to 0.84. We also support clinical evidence synthesis in forest plots, as validated by eight human annotators who preferred TrialMind over the GPT-4 baseline with a winning rate of 62.5%-100% across the involved reviews. Our findings suggest that an LLM-based clinical evidence synthesis approach, such as TrialMind, can enable reliable and high-quality clinical evidence synthesis to improve clinical research efficiency.
Related papers
- Leveraging AI to Accelerate Clinical Data Cleaning: A Comparative Study of AI-Assisted vs. Traditional Methods [3.2666593942117688]
Octozi is an artificial intelligence-assisted platform that combines large language models with domain-specifics to transform clinical data review.<n>We demonstrate that AI assistance increased data cleaning throughput by 6.03-fold while simultaneously decreasing cleaning errors from 54.67% to 8.48%.<n>The system reduced false positive queries by 15.48-fold, minimizing unnecessary site burden.
arXiv Detail & Related papers (2025-08-07T15:49:32Z) - An Agentic System for Rare Disease Diagnosis with Traceable Reasoning [58.78045864541539]
We introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM)<n>DeepRare generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning.<n>The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases.
arXiv Detail & Related papers (2025-06-25T13:42:26Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.
We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.
Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - A foundation model for human-AI collaboration in medical literature mining [27.473923366948032]
We present LEADS, an AI foundation model for study search, screening, and data extraction from medical literature.
The model is trained on 633,759 instruction data points in LEADSInstruct, curated from 21,335 systematic reviews, 453,625 clinical trial publications, and 27,015 clinical trial registries.
arXiv Detail & Related papers (2025-01-27T17:55:37Z) - End-To-End Clinical Trial Matching with Large Language Models [0.6151041580858937]
We present an end-to-end pipeline for clinical trial matching using Large Language Models (LLMs)
Our approach identifies relevant candidate trials in 93.3% of cases and achieves a preliminary accuracy of 88.0%.
Our fully end-to-end pipeline can operate autonomously or with human supervision and is not restricted to oncology.
arXiv Detail & Related papers (2024-07-18T12:36:26Z) - TrialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets [57.067409211231244]
This paper presents meticulously curated AIready datasets covering multi-modal data (e.g., drug molecule, disease code, text, categorical/numerical features) and 8 crucial prediction challenges in clinical trial design.
We provide basic validation methods for each task to ensure the datasets' usability and reliability.
We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design.
arXiv Detail & Related papers (2024-06-30T09:13:10Z) - De-identification of clinical free text using natural language
processing: A systematic review of current approaches [48.343430343213896]
Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process.
Our study aims to provide systematic evidence on how the de-identification of clinical free text has evolved in the last thirteen years.
arXiv Detail & Related papers (2023-11-28T13:20:41Z) - Investigating Deep-Learning NLP for Automating the Extraction of
Oncology Efficacy Endpoints from Scientific Literature [0.0]
We have developed and optimised a framework to extract efficacy endpoints from text in scientific papers.
Our machine learning model predicts 25 classes associated with efficacy endpoints and leads to high F1 scores.
arXiv Detail & Related papers (2023-11-03T14:01:54Z) - Clinical Camel: An Open Expert-Level Medical Language Model with
Dialogue-Based Knowledge Encoding [31.884600238089405]
We present Clinical Camel, an open large language model (LLM) explicitly tailored for clinical research.
Fine-tuned from LLaMA-2 using QLoRA, Clinical Camel achieves state-of-the-art performance across medical benchmarks among openly available medical LLMs.
arXiv Detail & Related papers (2023-05-19T23:07:09Z) - SPOT: Sequential Predictive Modeling of Clinical Trial Outcome with
Meta-Learning [67.8195828626489]
Clinical trials are essential to drug development but time-consuming, costly, and prone to failure.
We propose Sequential Predictive mOdeling of clinical Trial outcome (SPOT) that first identifies trial topics to cluster the multi-sourced trial data into relevant trial topics.
With the consideration of each trial sequence as a task, it uses a meta-learning strategy to achieve a point where the model can rapidly adapt to new tasks with minimal updates.
arXiv Detail & Related papers (2023-04-07T23:04:27Z) - Large Language Models for Healthcare Data Augmentation: An Example on
Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM)
Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z) - ANOVA-based Automatic Attribute Selection and a Predictive Model for
Heart Disease Prognosis [8.258177970935085]
This work proposes an information fusion technique that combines key attributes of a person through analysis of variance (ANOVA) and domain experts' knowledge.
The proposed approach can achieve a competitive mean average accuracy (mAA) of 99.2% and a mean average AUC of 97.9%.
arXiv Detail & Related papers (2022-07-30T19:29:18Z) - A Systematic Review of Natural Language Processing Applied to Radiology
Reports [3.600747505433814]
This study systematically assesses recent literature in NLP applied to radiology reports.
Our analysis is based on 21 variables including radiology characteristics, NLP methodology, performance, study, and clinical application characteristics.
arXiv Detail & Related papers (2021-02-18T18:54:41Z) - HINT: Hierarchical Interaction Network for Trial Outcome Prediction
Leveraging Web Data [56.53715632642495]
Clinical trials face uncertain outcomes due to issues with efficacy, safety, or problems with patient recruitment.
In this paper, we propose Hierarchical INteraction Network (HINT) for more general, clinical trial outcome predictions.
arXiv Detail & Related papers (2021-02-08T15:09:07Z) - CovidDeep: SARS-CoV-2/COVID-19 Test Based on Wearable Medical Sensors
and Efficient Neural Networks [51.589769497681175]
The novel coronavirus (SARS-CoV-2) has led to a pandemic.
The current testing regime based on Reverse Transcription-Polymerase Chain Reaction for SARS-CoV-2 has been unable to keep up with testing demands.
We propose a framework called CovidDeep that combines efficient DNNs with commercially available WMSs for pervasive testing of the virus.
arXiv Detail & Related papers (2020-07-20T21:47:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.