Automatically Labeling $200B Life-Saving Datasets: A Large Clinical Trial Outcome Benchmark
- URL: http://arxiv.org/abs/2406.10292v2
- Date: Mon, 03 Feb 2025 20:16:56 GMT
- Title: Automatically Labeling $200B Life-Saving Datasets: A Large Clinical Trial Outcome Benchmark
- Authors: Chufan Gao, Jathurshan Pradeepkumar, Trisha Das, Shivashankar Thati, Jimeng Sun,
- Abstract summary: Clinical trial outcomes play a critical role in the regulatory approval of new drugs and impacting patient outcomes.
Despite their significance, large-scale, high-quality clinical trial outcome data are not readily available to the public.
We introduce the Clinical Trial Outcome knowledge base, a fully reproducible, large-scale (around 125K drug and biologics trials)
We additionally performed manual annotation of a set of recent clinical trials from 2020-2024.
- Score: 24.663798850232588
- License:
- Abstract: Background: The global cost of drug discovery and development exceeds $200 billion annually, with clinical trial outcomes playing a critical role in the regulatory approval of new drugs and impacting patient outcomes. Despite their significance, large-scale, high-quality clinical trial outcome data are not readily available to the public, limiting advances in trial outcome predictive modeling. Methods: We introduce the Clinical Trial Outcome (CTO) knowledge base, a fully reproducible, large-scale (around 125K drug and biologics trials), open-source of clinical trial information including large language model (LLM) interpretations of publications, matched trials over phases, sentiment analysis from news, stock prices of trial sponsors, and other trial-related metrics. From this knowledge base, we additionally performed manual annotation of a set of recent clinical trials from 2020-2024. Results: We evaluated the quality of our knowledge base by generating high-quality trial outcome labels that demonstrate strong agreement with previously published expert annotations, achieving an F1 score of 94 for Phase 3 trials and 91 across all phases. Additionally, we benchmarked a suite of standard machine learning models on our manually annotated set, highlighting the distribution shift of recent trials and the need for continuously updated labeling methods. Conclusions: By analyzing CTO's performance on recent trials, we showed a need for recent, high-quality trial outcome labels. We release our knowledge base and labels to the public at https://chufangao.github.io/CTOD, which will also be regularly updated to support ongoing research in clinical trial outcomes, offering insights that could optimize the drug development process.
Related papers
- TrialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets [57.067409211231244]
This paper presents meticulously curated AIready datasets covering multi-modal data (e.g., drug molecule, disease code, text, categorical/numerical features) and 8 crucial prediction challenges in clinical trial design.
We provide basic validation methods for each task to ensure the datasets' usability and reliability.
We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design.
arXiv Detail & Related papers (2024-06-30T09:13:10Z) - TrialDura: Hierarchical Attention Transformer for Interpretable Clinical Trial Duration Prediction [19.084936647082632]
We propose TrialDura, a machine learning-based method that estimates the duration of clinical trials using multimodal data.
We encode them into Bio-BERT embeddings specifically tuned for biomedical contexts to provide a deeper and more relevant semantic understanding.
Our proposed model demonstrated superior performance with a mean absolute error (MAE) of 1.04 years and a root mean square error (RMSE) of 1.39 years compared to the other models.
arXiv Detail & Related papers (2024-04-20T02:12:59Z) - TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic
Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment.
In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials.
We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z) - AutoTrial: Prompting Language Models for Clinical Trial Design [53.630479619856516]
We present a method named AutoTrial to aid the design of clinical eligibility criteria using language models.
Experiments on over 70K clinical trials verify that AutoTrial generates high-quality criteria texts.
arXiv Detail & Related papers (2023-05-19T01:04:16Z) - SPOT: Sequential Predictive Modeling of Clinical Trial Outcome with
Meta-Learning [67.8195828626489]
Clinical trials are essential to drug development but time-consuming, costly, and prone to failure.
We propose Sequential Predictive mOdeling of clinical Trial outcome (SPOT) that first identifies trial topics to cluster the multi-sourced trial data into relevant trial topics.
With the consideration of each trial sequence as a task, it uses a meta-learning strategy to achieve a point where the model can rapidly adapt to new tasks with minimal updates.
arXiv Detail & Related papers (2023-04-07T23:04:27Z) - Trial2Vec: Zero-Shot Clinical Trial Document Similarity Search using
Self-Supervision [42.859662256134584]
We propose Trial2Vec, which learns through self-supervision without annotating similar clinical trials.
meta-structure of trial documents (e.g., title, eligibility criteria, target disease) along with clinical knowledge are leveraged to automatically generate contrastive samples.
We show that our method yields medically interpretable embeddings by visualization and it gets a 15% average improvement over the best baselines on precision/recall for trial retrieval.
arXiv Detail & Related papers (2022-06-29T15:37:11Z) - Clinical trial site matching with improved diversity using fair policy
learning [56.01170456417214]
We learn a model that maps a clinical trial description to a ranked list of potential trial sites.
Unlike existing fairness frameworks, the group membership of each trial site is non-binary.
We propose fairness criteria based on demographic parity to address such a multi-group membership scenario.
arXiv Detail & Related papers (2022-04-13T16:35:28Z) - Validating GAN-BioBERT: A Methodology For Assessing Reporting Trends In
Clinical Trials [3.164363223464948]
This study develops a sentiment classification algorithm for clinical trial abstracts using a semi-supervised natural language process model.
The use of this algorithm was found to have a classification accuracy of 91.3%, with a macro F1-Score of 0.92, which is a significant improvement in accuracy when compared to previous methods.
arXiv Detail & Related papers (2021-06-01T17:51:54Z) - HINT: Hierarchical Interaction Network for Trial Outcome Prediction
Leveraging Web Data [56.53715632642495]
Clinical trials face uncertain outcomes due to issues with efficacy, safety, or problems with patient recruitment.
In this paper, we propose Hierarchical INteraction Network (HINT) for more general, clinical trial outcome predictions.
arXiv Detail & Related papers (2021-02-08T15:09:07Z) - Predicting Clinical Trial Results by Implicit Evidence Integration [40.80948875051806]
We introduce a novel Clinical Trial Result Prediction (CTRP) task.
In the CTRP framework, a model takes a PICO-formatted clinical trial proposal with its background as input and predicts the result.
We exploit large-scale unstructured sentences from medical literature that implicitly contain PICOs and results as evidence.
arXiv Detail & Related papers (2020-10-12T12:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.