Related papers: Filter Drug-induced Liver Injury Literature with Natural Language Processing and Ensemble Learning

Filter Drug-induced Liver Injury Literature with Natural Language Processing and Ensemble Learning

URL: http://arxiv.org/abs/2203.11015v1
Date: Wed, 9 Mar 2022 23:53:07 GMT
Title: Filter Drug-induced Liver Injury Literature with Natural Language Processing and Ensemble Learning
Authors: Xianghao Zhan, Fanjin Wang, Olivier Gevaert
Abstract summary: Drug-induced liver injury (DILI) describes the adverse effects of drugs that damage liver. Life-threatening results including liver failure or death were also reported in severe DILI cases. Data extraction from previous publications relies heavily on manual labelling. Recent development of artificial intelligence enabled automatic processing of biomedical texts.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Drug-induced liver injury (DILI) describes the adverse effects of drugs that damage liver. Life-threatening results including liver failure or death were also reported in severe DILI cases. Therefore, DILI-related events are strictly monitored for all approved drugs and the liver toxicity became important assessments for new drug candidates. These DILI-related reports are documented in hospital records, in clinical trial results, and also in research papers that contain preliminary in vitro and in vivo experiments. Conventionally, data extraction from previous publications relies heavily on resource-demanding manual labelling, which considerably decreased the efficiency of the information extraction process. The recent development of artificial intelligence, particularly, the rise of natural language processing (NLP) techniques, enabled the automatic processing of biomedical texts. In this study, based on around 28,000 papers (titles and abstracts) provided by the Critical Assessment of Massive Data Analysis (CAMDA) challenge, we benchmarked model performances on filtering out DILI literature. Among four word vectorization techniques, the model using term frequency-inverse document frequency (TF-IDF) and logistic regression outperformed others with an accuracy of 0.957 with our in-house test set. Furthermore, an ensemble model with similar overall performances was implemented and was fine-tuned to lower the false-negative cases to avoid neglecting potential DILI reports. The ensemble model achieved a high accuracy of 0.954 and an F1 score of 0.955 in the hold-out validation data provided by the CAMDA committee. Moreover, important words in positive/negative predictions were identified via model interpretation. Overall, the ensemble model reached satisfactory classification results, which can be further used by researchers to rapidly filter DILI-related literature.

Related papers

StackLiverNet: A Novel Stacked Ensemble Model for Accurate and Interpretable Liver Disease Detection [0.0]
StackLiverNet is an interpretable stacked ensemble model tailored to the liver disease detection task.<n>The framework uses advanced data preprocessing and feature selection technique to increase model robustness and predictive ability.<n>The provided model demonstrates excellent performance, with the testing accuracy of 99.89%, Cohen Kappa of 0.9974, and AUC of 0.9993, having only 5 misclassifications.
arXiv Detail & Related papers (2025-07-31T19:13:30Z)
Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents [64.43980129731587]
We propose a causal-inspired inference-time debiasing method called Causal Diagnosis and Correction (CDC) CDC first diagnoses the bias effect of the perplexity and then separates the bias effect from the overall relevance score. Experimental results across three domains demonstrate the superior debiasing effectiveness.
arXiv Detail & Related papers (2025-03-11T17:59:00Z)
EVolutionary Independent DEtermiNistiC Explanation [5.127310126394387]
This paper introduces the Evolutionary Independent Deterministic Explanation (EVIDENCE) theory. EVIDENCE offers a deterministic, model-independent method for extracting significant signals from black-box models. Practical applications of EVIDENCE include improving diagnostic accuracy in healthcare and enhancing audio signal analysis.
arXiv Detail & Related papers (2025-01-20T12:05:14Z)
CRTRE: Causal Rule Generation with Target Trial Emulation Framework [47.2836994469923]
We introduce a novel method called causal rule generation with target trial emulation framework (CRTRE) CRTRE applies randomize trial design principles to estimate the causal effect of association rules. We then incorporate such association rules for the downstream applications such as prediction of disease onsets.
arXiv Detail & Related papers (2024-11-10T02:40:06Z)
Data-Driven Machine Learning Approaches for Predicting In-Hospital Sepsis Mortality [0.0]
This research aims to develop an interpretable and accurate ML model to help clinical professionals predict in-hospital mortality. We analyzed ICU patient records from the MIMIC-III database based on specific criteria and extracted relevant data. The Random Forest model was the most effective in predicting sepsis-related in-hospital mortality.
arXiv Detail & Related papers (2024-08-03T00:28:25Z)
Machine Learning for ALSFRS-R Score Prediction: Making Sense of the Sensor Data [44.99833362998488]
Amyotrophic Lateral Sclerosis (ALS) is a rapidly progressive neurodegenerative disease that presents individuals with limited treatment options. The present investigation, spearheaded by the iDPP@CLEF 2024 challenge, focuses on utilizing sensor-derived data obtained through an app.
arXiv Detail & Related papers (2024-07-10T19:17:23Z)
Detecting the Clinical Features of Difficult-to-Treat Depression using Synthetic Data from Large Language Models [0.20971479389679337]
We seek to develop a Large Language Model (LLM)-based tool capable of interrogating routinely-collected, narrative (free-text) electronic health record data. We use LLM-generated synthetic data (GPT3.5) and a Non-Maximum Suppression (NMS) algorithm to train a BERT-based span extraction model. We show it is possible to obtain good overall performance (0.70 F1 across polarity) on real clinical data on a set of as many as 20 different factors, and high performance (0.85 F1 with 0.95 precision) on a subset of important DTD
arXiv Detail & Related papers (2024-02-12T13:34:33Z)
MedDistant19: A Challenging Benchmark for Distantly Supervised Biomedical Relation Extraction [19.046156065686308]
Distant supervision is commonly used to tackle the scarcity of annotated data. Bio-DSRE models can seemingly produce very accurate results in several benchmarks. However, given the challenging nature of the task, we set out to investigate the validity of such impressive results.
arXiv Detail & Related papers (2022-04-10T22:07:25Z)
Assessment of contextualised representations in detecting outcome phrases in clinical trials [14.584741378279316]
We introduce "EBM-COMET", a dataset in which 300 PubMed abstracts are expertly annotated for clinical outcomes. To extract outcomes, we fine-tune a variety of pre-trained contextualized representations. We observe our best model (BioBERT) achieve 81.5% F1, 81.3% sensitivity and 98.0% specificity.
arXiv Detail & Related papers (2022-02-13T15:08:00Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
HINT: Hierarchical Interaction Network for Trial Outcome Prediction Leveraging Web Data [56.53715632642495]
Clinical trials face uncertain outcomes due to issues with efficacy, safety, or problems with patient recruitment. In this paper, we propose Hierarchical INteraction Network (HINT) for more general, clinical trial outcome predictions.
arXiv Detail & Related papers (2021-02-08T15:09:07Z)
UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model. UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data. We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD) UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
Predicting Clinical Trial Results by Implicit Evidence Integration [40.80948875051806]
We introduce a novel Clinical Trial Result Prediction (CTRP) task. In the CTRP framework, a model takes a PICO-formatted clinical trial proposal with its background as input and predicts the result. We exploit large-scale unstructured sentences from medical literature that implicitly contain PICOs and results as evidence.
arXiv Detail & Related papers (2020-10-12T12:25:41Z)
Understanding Clinical Trial Reports: Extracting Medical Entities and Their Relations [33.30381080306156]
Medical experts must manually extract information from articles to inform decision-making. We consider the end-to-end task of both (a) extracting treatments and outcomes from full-text articles describing clinical trials (entity identification) and (b) inferring the reported results for the former with respect to the latter.
arXiv Detail & Related papers (2020-10-07T17:50:58Z)
Predicting Clinical Diagnosis from Patients Electronic Health Records Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community. We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence. We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.