Related papers: CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design

CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design

URL: http://arxiv.org/abs/2406.17888v1
Date: Tue, 25 Jun 2024 18:52:48 GMT
Title: CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design
Authors: Nafis Neehal, Bowen Wang, Shayom Debopadhaya, Soham Dan, Keerthiram Murugesan, Vibha Anand, Kristin P. Bennett,
Abstract summary: CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. It consists of two datasets: "CT-Repo," containing baseline features from 1,690 clinical trials sourced from clinicaltrials.gov, and "CT-Pub," a subset of 100 trials with more comprehensive baseline features gathered from relevant publications.
Score: 15.2100541345819
License: http://creativecommons.org/licenses/by/4.0/
Abstract: CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. Given study-specific metadata, CTBench evaluates AI models' ability to determine the baseline features of a clinical trial (CT), which include demographic and relevant features collected at the trial's start from all participants. These baseline features, typically presented in CT publications (often as Table 1), are crucial for characterizing study cohorts and validating results. Baseline features, including confounders and covariates, are also necessary for accurate treatment effect estimation in studies involving observational data. CTBench consists of two datasets: "CT-Repo," containing baseline features from 1,690 clinical trials sourced from clinicaltrials.gov, and "CT-Pub," a subset of 100 trials with more comprehensive baseline features gathered from relevant publications. Two LM-based evaluation methods are developed to compare the actual baseline feature lists against LM-generated responses. "ListMatch-LM" and "ListMatch-BERT" use GPT-4o and BERT scores (at various thresholds), respectively, for evaluation. To establish baseline results, advanced prompt engineering techniques using LLaMa3-70B-Instruct and GPT-4o in zero-shot and three-shot learning settings are applied to generate potential baseline features. The performance of GPT-4o as an evaluator is validated through human-in-the-loop evaluations on the CT-Pub dataset, where clinical experts confirm matches between actual and LM-generated features. The results highlight a promising direction with significant potential for improvement, positioning CTBench as a useful tool for advancing research on AI in CT design and potentially enhancing the efficacy and robustness of CTs.

Related papers

AUTOCT: Automating Interpretable Clinical Trial Prediction with LLM Agents [47.640779069547534]
AutoCT is a novel framework that combines the reasoning capabilities of large language models with the explainability of classical machine learning.<n>We show that AutoCT performs on par with or better than SOTA methods on clinical trial prediction tasks within only a limited number of self-refinement iterations.
arXiv Detail & Related papers (2025-06-04T11:50:55Z)
AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology [47.52685298426068]
We systematically evaluate the reasoning capabilities of large language models (LLMs) in anesthesiology.<n>AnesBench is a cross-lingual benchmark designed to assess anesthesiology-related reasoning across three levels.
arXiv Detail & Related papers (2025-04-03T08:54:23Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references. We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey. Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule Endoscopy [63.39037092484374]
This study focuses on the clinical evaluation of medical Synthetic Data Generation using Artificial Intelligence (AI) models. The paper contributes by a) presenting a protocol for the systematic evaluation of synthetic images by medical experts and b) applying it to assess TIDE-II, a novel variational autoencoder-based model for high-resolution WCE image synthesis. The results show that TIDE-II generates clinically relevant WCE images, helping to address data scarcity and enhance diagnostic tools.
arXiv Detail & Related papers (2024-10-31T19:48:50Z)
Towards Efficient Patient Recruitment for Clinical Trials: Application of a Prompt-Based Learning Model [0.7373617024876725]
Clinical trials are essential for advancing pharmaceutical interventions, but they face a bottleneck in selecting eligible participants. The complex nature of unstructured medical texts presents challenges in efficiently identifying participants. In this study, we aimed to evaluate the performance of a prompt-based large language model for the cohort selection task.
arXiv Detail & Related papers (2024-04-24T20:42:28Z)
A Dataset and Benchmark for Hospital Course Summarization with Adapted Large Language Models [4.091402760759184]
Large language models (LLMs) depict remarkable capabilities in automating real-world tasks, but their capabilities for healthcare applications have not been shown. We introduce a novel pre-processed dataset, the MIMIC-IV-BHC, encapsulating clinical note and brief hospital course (BHC) pairs to adapt LLMs for BHC. Using clinical notes as input, we apply prompting-based (using in-context learning) and fine-tuning-based adaptation strategies to three open-source LLMs and two proprietary LLMs.
arXiv Detail & Related papers (2024-03-08T23:17:55Z)
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation. Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process. AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z)
Towards Unifying Anatomy Segmentation: Automated Generation of a Full-body CT Dataset via Knowledge Aggregation and Anatomical Guidelines [113.08940153125616]
We generate a dataset of whole-body CT scans with $142$ voxel-level labels for 533 volumes providing comprehensive anatomical coverage. Our proposed procedure does not rely on manual annotation during the label aggregation stage. We release our trained unified anatomical segmentation model capable of predicting $142$ anatomical structures on CT data.
arXiv Detail & Related papers (2023-07-25T09:48:13Z)
TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment. In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials. We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z)
Effective Matching of Patients to Clinical Trials using Entity Extraction and Neural Re-ranking [8.200196331837576]
Clinical trials (CTs) often fail due to inadequate patient recruitment. This paper tackles the challenges of CT retrieval by presenting an approach that addresses the patient-to-trials paradigm.
arXiv Detail & Related papers (2023-07-01T16:42:39Z)
AutoTrial: Prompting Language Models for Clinical Trial Design [53.630479619856516]
We present a method named AutoTrial to aid the design of clinical eligibility criteria using language models. Experiments on over 70K clinical trials verify that AutoTrial generates high-quality criteria texts.
arXiv Detail & Related papers (2023-05-19T01:04:16Z)
Improving Large Language Models for Clinical Named Entity Recognition via Prompt Engineering [20.534197056683695]
This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks. We developed a task-specific prompt framework that includes baseline prompts, annotation guideline-based prompts, error analysis-based instructions, and annotated samples. We assessed each prompt's effectiveness and compared the models to BioClinicalBERT.
arXiv Detail & Related papers (2023-03-29T02:46:18Z)
Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain. We annotated a corpus of clinical documents according to 12 types of identifying entities. We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z)
Clinical Trial Information Extraction with BERT [0.0]
We propose a framework called CT-BERT for information extraction from clinical trial text. We trained named entity recognition (NER) models to extract eligibility criteria entities. The results demonstrate the superiority of CT-BERT in clinical trial NLP.
arXiv Detail & Related papers (2021-09-11T17:15:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.