Related papers: Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources

Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources

URL: http://arxiv.org/abs/2406.18049v1
Date: Wed, 26 Jun 2024 03:56:21 GMT
Title: Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources
Authors: Yiming Li, Deepthi Viswaroopan, William He, Jianfu Li, Xu Zuo, Hua Xu, Cui Tao,
Abstract summary: Adverse event (AE) extraction is crucial for monitoring and analyzing the safety profiles of immunizations. This study aims to evaluate the effectiveness of large language models (LLMs) and traditional deep learning models in AE extraction.
Score: 13.750202656564907
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adverse event (AE) extraction following COVID-19 vaccines from text data is crucial for monitoring and analyzing the safety profiles of immunizations. Traditional deep learning models are adept at learning intricate feature representations and dependencies in sequential data, but often require extensive labeled data. In contrast, large language models (LLMs) excel in understanding contextual information, but exhibit unstable performance on named entity recognition tasks, possibly due to their broad but unspecific training. This study aims to evaluate the effectiveness of LLMs and traditional deep learning models in AE extraction, and to assess the impact of ensembling these models on performance. In this study, we utilized reports and posts from the VAERS (n=621), Twitter (n=9,133), and Reddit (n=131) as our corpora. Our goal was to extract three types of entities: "vaccine", "shot", and "ae". We explored and fine-tuned (except GPT-4) multiple LLMs, including GPT-2, GPT-3.5, GPT-4, and Llama-2, as well as traditional deep learning models like RNN and BioBERT. To enhance performance, we created ensembles of the three models with the best performance. For evaluation, we used strict and relaxed F1 scores to evaluate the performance for each entity type, and micro-average F1 was used to assess the overall performance. The ensemble model achieved the highest performance in "vaccine", "shot", and "ae" with strict F1-scores of 0.878, 0.930, and 0.925, respectively, along with a micro-average score of 0.903. In conclusion, this study demonstrates the effectiveness and robustness of ensembling fine-tuned traditional deep learning models and LLMs, for extracting AE-related information. This study contributes to the advancement of biomedical natural language processing, providing valuable insights into improving AE extraction from text data for pharmacovigilance and public health surveillance.

Related papers

Comparative Evaluation of Radiomics and Deep Learning Models for Disease Detection in Chest Radiography [0.0]
We evaluate radiomics-based and deep learning-based approaches for disease detection in chest radiography.<n>Deep learning models learn directly from image data, while radiomics-based models extract handcrafted features.<n>These findings provide statistically validated, data-driven recommendations for model selection in diagnostic AI.
arXiv Detail & Related papers (2025-04-16T16:54:37Z)
A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models [7.923208324118286]
We study patterns in the performance of OpenAI LLMs across a diverse sampling of biomedical relation extraction tasks. We found the zero-shot performances to be proximal to that of fine-tuned methods.
arXiv Detail & Related papers (2025-04-05T07:08:54Z)
AI-Assisted Colonoscopy: Polyp Detection and Segmentation using Foundation Models [0.10037949839020764]
In colonoscopy, 80% of the missed polyps could be detected with the help of Deep Learning models. In the search for algorithms capable of addressing this challenge, foundation models emerge as promising candidates. Their zero-shot or few-shot learning capabilities, facilitate generalization to new data or tasks without extensive fine-tuning. A comprehensive evaluation of foundation models for polyp segmentation was conducted, assessing both detection and delimitation.
arXiv Detail & Related papers (2025-03-31T14:20:53Z)
Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks [2.7729041396205014]
This study evaluates the classification performance of five open-source large language models (LLMs)<n>We report precision, recall, and F1 scores with 95% confidence intervals for all model-task combinations.
arXiv Detail & Related papers (2025-03-19T12:51:52Z)
Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks [59.47851630504264]
Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data. We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods. The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization.
arXiv Detail & Related papers (2025-02-07T10:01:32Z)
Information Extraction from Clinical Notes: Are We Ready to Switch to Large Language Models? [16.312594953592665]
Large language models (LLMs) excel on generative tasks, but their performance on extractive tasks remains debated. This study is among the first to develop and evaluate a comprehensive clinical IE system using open-source LLMs.
arXiv Detail & Related papers (2024-11-15T07:54:19Z)
Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions [0.0]
Large Language Model (LLM) generation of synthetic training examples could improve performance in these information extraction tasks. We measured overall and Out of Distribution (OOD) performance for Disease Entity Recognition (DER) and Disease Entity Normalization (DEN) Our synthetic data yielded a substantial improvement for DEN, in all 3 training corpora the top 1 accuracy of both SapBERT and KrissBERT improved by 3-9 points in overall performance and by 20-55 points in OOD data.
arXiv Detail & Related papers (2024-10-10T14:18:34Z)
Enhancing SLM via ChatGPT and Dataset Augmentation [0.3844771221441211]
We employ knowledge distillation-based techniques and synthetic dataset augmentation to bridge the performance gap between large language models (LLMs) and small language models (SLMs) Our methods involve two forms of rationale generation--information extraction and informed reasoning--to enrich the ANLI dataset. Our findings reveal that the incorporation of synthetic rationales significantly improves the model's ability to comprehend natural language, leading to 1.3% and 2.3% higher classification accuracy, respectively, on the ANLI dataset.
arXiv Detail & Related papers (2024-09-19T09:24:36Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress. LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset. Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
On Machine Learning Approaches for Protein-Ligand Binding Affinity Prediction [2.874893537471256]
This study evaluates the performance of classical tree-based models and advanced neural networks in protein-ligand binding affinity prediction. We show that combining 2D and 3D model strengths improves active learning outcomes beyond current state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-15T13:06:00Z)
Extracting Training Data from Unconditional Diffusion Models [76.85077961718875]
diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI) We aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization. Based on the theoretical analysis, we propose a novel data extraction method called textbfSurrogate condItional Data Extraction (SIDE) that leverages a trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models.
arXiv Detail & Related papers (2024-06-18T16:20:12Z)
Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events [17.73671383380315]
We study how large language models (LLMs) can be used to scale biomedical knowledge curation. We find that substantial gains can be attained over out-of-box LLMs, with additional advantages such as cost, efficiency, and white-box model access.
arXiv Detail & Related papers (2023-07-12T20:08:48Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
Textual Data Augmentation for Patient Outcomes Prediction [67.72545656557858]
We propose a novel data augmentation method to generate artificial clinical notes in patients' Electronic Health Records. We fine-tune the generative language model GPT-2 to synthesize labeled text with the original training data. We evaluate our method on the most common patient outcome, i.e., the 30-day readmission rate.
arXiv Detail & Related papers (2022-11-13T01:07:23Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Are Sample-Efficient NLP Models More Robust? [90.54786862811183]
We investigate the relationship between sample efficiency (amount of data needed to reach a given ID accuracy) and robustness (how models fare on OOD evaluation) We find that higher sample efficiency is only correlated with better average OOD robustness on some modeling interventions and tasks, but not others. These results suggest that general-purpose methods for improving sample efficiency are unlikely to yield universal OOD robustness improvements, since such improvements are highly dataset- and task-dependent.
arXiv Detail & Related papers (2022-10-12T17:54:59Z)
An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models. We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.