Probing Pre-Trained Language Models for Disease Knowledge
- URL: http://arxiv.org/abs/2106.07285v1
- Date: Mon, 14 Jun 2021 10:31:25 GMT
- Title: Probing Pre-Trained Language Models for Disease Knowledge
- Authors: Israa Alghanmi, Luis Espinosa-Anke, Steven Schockaert
- Abstract summary: We introduce DisKnE, a new benchmark for Disease Knowledge Evaluation.
We define training-test splits per disease, ensuring that no knowledge about test diseases can be learned from the training data.
When analysing pre-trained models for the clinical/biomedical domain on the proposed benchmark, we find that their performance drops considerably.
- Score: 38.73378973397647
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained language models such as ClinicalBERT have achieved impressive
results on tasks such as medical Natural Language Inference. At first glance,
this may suggest that these models are able to perform medical reasoning tasks,
such as mapping symptoms to diseases. However, we find that standard benchmarks
such as MedNLI contain relatively few examples that require such forms of
reasoning. To better understand the medical reasoning capabilities of existing
language models, in this paper we introduce DisKnE, a new benchmark for Disease
Knowledge Evaluation. To construct this benchmark, we annotated each positive
MedNLI example with the types of medical reasoning that are needed. We then
created negative examples by corrupting these positive examples in an
adversarial way. Furthermore, we define training-test splits per disease,
ensuring that no knowledge about test diseases can be learned from the training
data, and we canonicalize the formulation of the hypotheses to avoid the
presence of artefacts. This leads to a number of binary classification
problems, one for each type of reasoning and each disease. When analysing
pre-trained models for the clinical/biomedical domain on the proposed
benchmark, we find that their performance drops considerably.
Related papers
- SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research [45.2233252981348]
Large Language Models have shown promising results in their ability to encode general medical knowledge.
We test the ability of state-of-the-art LLMs to leverage their internal knowledge and reasoning for epilepsy diagnosis.
arXiv Detail & Related papers (2024-07-03T11:02:12Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Robust and Interpretable Medical Image Classifiers via Concept
Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts.
Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z) - Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review
and Replicability Study [60.56194508762205]
We reproduce, compare, and analyze state-of-the-art automated medical coding machine learning models.
We show that several models underperform due to weak configurations, poorly sampled train-test splits, and insufficient evaluation.
We present the first comprehensive results on the newly released MIMIC-IV dataset using the reproduced models.
arXiv Detail & Related papers (2023-04-21T11:54:44Z) - Semantic Coherence Markers for the Early Diagnosis of the Alzheimer
Disease [0.0]
Perplexity was originally conceived as an information-theoretic measure to assess how much a given language model is suited to predict a text sequence.
We employed language models as diverse as N-grams, from 2-grams to 5-grams, and GPT-2, a transformer-based language model.
Best performing models achieved full accuracy and F-score (1.00 in both precision/specificity and recall/sensitivity) in categorizing subjects from both the AD class and control subjects.
arXiv Detail & Related papers (2023-02-02T11:40:16Z) - This Patient Looks Like That Patient: Prototypical Networks for
Interpretable Diagnosis Prediction from Clinical Text [56.32427751440426]
In clinical practice such models must not only be accurate, but provide doctors with interpretable and helpful results.
We introduce ProtoPatient, a novel method based on prototypical networks and label-wise attention.
We evaluate the model on two publicly available clinical datasets and show that it outperforms existing baselines.
arXiv Detail & Related papers (2022-10-16T10:12:07Z) - Clinical Language Understanding Evaluation (CLUE) [17.254884920876695]
We present the Clinical Language Understanding Evaluation (CLUE) benchmark with a set of four clinical language understanding tasks, standard training, development, validation and testing sets derived from MIMIC data.
It is our hope that these data will enable direct comparison between approaches, improve and reduce the barrier-to-entry for developing novel models or methods for these clinical language understanding tasks.
arXiv Detail & Related papers (2022-09-28T19:14:08Z) - Stress Test Evaluation of Biomedical Word Embeddings [3.8376078864105425]
We systematically evaluate three language models with adversarial examples.
We show that adversarial training causes the models to improve their robustness and even to exceed the original performance in some cases.
arXiv Detail & Related papers (2021-07-24T16:45:03Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z) - On Adversarial Examples for Biomedical NLP Tasks [4.7677261488999205]
We propose an adversarial evaluation scheme on two well-known datasets for medical NER and STS.
We show that we can significantly improve the robustness of the models by training them with adversarial examples.
arXiv Detail & Related papers (2020-04-23T13:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.