Stress Test Evaluation of Biomedical Word Embeddings
- URL: http://arxiv.org/abs/2107.11652v1
- Date: Sat, 24 Jul 2021 16:45:03 GMT
- Title: Stress Test Evaluation of Biomedical Word Embeddings
- Authors: Vladimir Araujo, Andr\'es Carvallo, Carlos Aspillaga, Camilo Thorne,
Denis Parra
- Abstract summary: We systematically evaluate three language models with adversarial examples.
We show that adversarial training causes the models to improve their robustness and even to exceed the original performance in some cases.
- Score: 3.8376078864105425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of pretrained word embeddings has motivated their use in the
biomedical domain, with contextualized embeddings yielding remarkable results
in several biomedical NLP tasks. However, there is a lack of research on
quantifying their behavior under severe "stress" scenarios. In this work, we
systematically evaluate three language models with adversarial examples --
automatically constructed tests that allow us to examine how robust the models
are. We propose two types of stress scenarios focused on the biomedical named
entity recognition (NER) task, one inspired by spelling errors and another
based on the use of synonyms for medical terms. Our experiments with three
benchmarks show that the performance of the original models decreases
considerably, in addition to revealing their weaknesses and strengths. Finally,
we show that adversarial training causes the models to improve their robustness
and even to exceed the original performance in some cases.
Related papers
- DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness [27.14794371879541]
This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials.
By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement, we introduce greater diversity and reduce shortcut learning.
Our approach, combined with multi-task learning and the DeBERTa architecture, achieved significant performance gains on the NLI4CT 2024 benchmark.
arXiv Detail & Related papers (2024-04-14T10:02:47Z) - Context-aware Adversarial Attack on Named Entity Recognition [15.049160192547909]
We study context-aware adversarial attack methods to examine the model's robustness.
Specifically, we propose perturbing the most informative words for recognizing entities to create adversarial examples.
Experiments and analyses show that our methods are more effective in deceiving the model into making wrong predictions than strong baselines.
arXiv Detail & Related papers (2023-09-16T14:04:23Z) - BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types.
Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Assessment of contextualised representations in detecting outcome
phrases in clinical trials [14.584741378279316]
We introduce "EBM-COMET", a dataset in which 300 PubMed abstracts are expertly annotated for clinical outcomes.
To extract outcomes, we fine-tune a variety of pre-trained contextualized representations.
We observe our best model (BioBERT) achieve 81.5% F1, 81.3% sensitivity and 98.0% specificity.
arXiv Detail & Related papers (2022-02-13T15:08:00Z) - Self-training with Few-shot Rationalization: Teacher Explanations Aid
Student in Few-shot NLU [88.8401599172922]
We develop a framework based on self-training language models with limited task-specific labels and rationales.
We show that the neural model performance can be significantly improved by making it aware of its rationalized predictions.
arXiv Detail & Related papers (2021-09-17T00:36:46Z) - Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap.
We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z) - As Easy as 1, 2, 3: Behavioural Testing of NMT Systems for Numerical
Translation [51.20569527047729]
Mistranslated numbers have the potential to cause serious effects, such as financial loss or medical misinformation.
We develop comprehensive assessments of the robustness of neural machine translation systems to numerical text via behavioural testing.
arXiv Detail & Related papers (2021-07-18T04:09:47Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Probing Pre-Trained Language Models for Disease Knowledge [38.73378973397647]
We introduce DisKnE, a new benchmark for Disease Knowledge Evaluation.
We define training-test splits per disease, ensuring that no knowledge about test diseases can be learned from the training data.
When analysing pre-trained models for the clinical/biomedical domain on the proposed benchmark, we find that their performance drops considerably.
arXiv Detail & Related papers (2021-06-14T10:31:25Z) - On Adversarial Examples for Biomedical NLP Tasks [4.7677261488999205]
We propose an adversarial evaluation scheme on two well-known datasets for medical NER and STS.
We show that we can significantly improve the robustness of the models by training them with adversarial examples.
arXiv Detail & Related papers (2020-04-23T13:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.