Filter Drug-induced Liver Injury Literature with Natural Language
Processing and Ensemble Learning
- URL: http://arxiv.org/abs/2203.11015v1
- Date: Wed, 9 Mar 2022 23:53:07 GMT
- Title: Filter Drug-induced Liver Injury Literature with Natural Language
Processing and Ensemble Learning
- Authors: Xianghao Zhan, Fanjin Wang, Olivier Gevaert
- Abstract summary: Drug-induced liver injury (DILI) describes the adverse effects of drugs that damage liver.
Life-threatening results including liver failure or death were also reported in severe DILI cases.
Data extraction from previous publications relies heavily on manual labelling.
Recent development of artificial intelligence enabled automatic processing of biomedical texts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Drug-induced liver injury (DILI) describes the adverse effects of drugs that
damage liver. Life-threatening results including liver failure or death were
also reported in severe DILI cases. Therefore, DILI-related events are strictly
monitored for all approved drugs and the liver toxicity became important
assessments for new drug candidates. These DILI-related reports are documented
in hospital records, in clinical trial results, and also in research papers
that contain preliminary in vitro and in vivo experiments. Conventionally, data
extraction from previous publications relies heavily on resource-demanding
manual labelling, which considerably decreased the efficiency of the
information extraction process. The recent development of artificial
intelligence, particularly, the rise of natural language processing (NLP)
techniques, enabled the automatic processing of biomedical texts. In this
study, based on around 28,000 papers (titles and abstracts) provided by the
Critical Assessment of Massive Data Analysis (CAMDA) challenge, we benchmarked
model performances on filtering out DILI literature. Among four word
vectorization techniques, the model using term frequency-inverse document
frequency (TF-IDF) and logistic regression outperformed others with an accuracy
of 0.957 with our in-house test set. Furthermore, an ensemble model with
similar overall performances was implemented and was fine-tuned to lower the
false-negative cases to avoid neglecting potential DILI reports. The ensemble
model achieved a high accuracy of 0.954 and an F1 score of 0.955 in the
hold-out validation data provided by the CAMDA committee. Moreover, important
words in positive/negative predictions were identified via model
interpretation. Overall, the ensemble model reached satisfactory classification
results, which can be further used by researchers to rapidly filter
DILI-related literature.
Related papers
- CRTRE: Causal Rule Generation with Target Trial Emulation Framework [47.2836994469923]
We introduce a novel method called causal rule generation with target trial emulation framework (CRTRE)
CRTRE applies randomize trial design principles to estimate the causal effect of association rules.
We then incorporate such association rules for the downstream applications such as prediction of disease onsets.
arXiv Detail & Related papers (2024-11-10T02:40:06Z) - Data-Driven Machine Learning Approaches for Predicting In-Hospital Sepsis Mortality [0.0]
This research aims to develop an interpretable and accurate ML model to help clinical professionals predict in-hospital mortality.
We analyzed ICU patient records from the MIMIC-III database based on specific criteria and extracted relevant data.
The Random Forest model was the most effective in predicting sepsis-related in-hospital mortality.
arXiv Detail & Related papers (2024-08-03T00:28:25Z) - Machine Learning for ALSFRS-R Score Prediction: Making Sense of the Sensor Data [44.99833362998488]
Amyotrophic Lateral Sclerosis (ALS) is a rapidly progressive neurodegenerative disease that presents individuals with limited treatment options.
The present investigation, spearheaded by the iDPP@CLEF 2024 challenge, focuses on utilizing sensor-derived data obtained through an app.
arXiv Detail & Related papers (2024-07-10T19:17:23Z) - Detecting the Clinical Features of Difficult-to-Treat Depression using
Synthetic Data from Large Language Models [0.20971479389679337]
We seek to develop a Large Language Model (LLM)-based tool capable of interrogating routinely-collected, narrative (free-text) electronic health record data.
We use LLM-generated synthetic data (GPT3.5) and a Non-Maximum Suppression (NMS) algorithm to train a BERT-based span extraction model.
We show it is possible to obtain good overall performance (0.70 F1 across polarity) on real clinical data on a set of as many as 20 different factors, and high performance (0.85 F1 with 0.95 precision) on a subset of important DTD
arXiv Detail & Related papers (2024-02-12T13:34:33Z) - MedDistant19: A Challenging Benchmark for Distantly Supervised
Biomedical Relation Extraction [19.046156065686308]
Distant supervision is commonly used to tackle the scarcity of annotated data.
Bio-DSRE models can seemingly produce very accurate results in several benchmarks.
However, given the challenging nature of the task, we set out to investigate the validity of such impressive results.
arXiv Detail & Related papers (2022-04-10T22:07:25Z) - Assessment of contextualised representations in detecting outcome
phrases in clinical trials [14.584741378279316]
We introduce "EBM-COMET", a dataset in which 300 PubMed abstracts are expertly annotated for clinical outcomes.
To extract outcomes, we fine-tune a variety of pre-trained contextualized representations.
We observe our best model (BioBERT) achieve 81.5% F1, 81.3% sensitivity and 98.0% specificity.
arXiv Detail & Related papers (2022-02-13T15:08:00Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - HINT: Hierarchical Interaction Network for Trial Outcome Prediction
Leveraging Web Data [56.53715632642495]
Clinical trials face uncertain outcomes due to issues with efficacy, safety, or problems with patient recruitment.
In this paper, we propose Hierarchical INteraction Network (HINT) for more general, clinical trial outcome predictions.
arXiv Detail & Related papers (2021-02-08T15:09:07Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Understanding Clinical Trial Reports: Extracting Medical Entities and
Their Relations [33.30381080306156]
Medical experts must manually extract information from articles to inform decision-making.
We consider the end-to-end task of both (a) extracting treatments and outcomes from full-text articles describing clinical trials (entity identification) and (b) inferring the reported results for the former with respect to the latter.
arXiv Detail & Related papers (2020-10-07T17:50:58Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.