Related papers: QuaLLM-Health: An Adaptation of an LLM-Based Framework for Quantitative Data Extraction from Online Health Discussions

QuaLLM-Health: An Adaptation of an LLM-Based Framework for Quantitative Data Extraction from Online Health Discussions

URL: http://arxiv.org/abs/2411.17967v1
Date: Wed, 27 Nov 2024 00:52:21 GMT
Title: QuaLLM-Health: An Adaptation of an LLM-Based Framework for Quantitative Data Extraction from Online Health Discussions
Authors: Ramez Kouzy, Roxanna Attar-Olyaee, Michael K. Rooney, Comron J. Hassanzadeh, Junyi Jessy Li, Osama Mohamad,
Abstract summary: We present an adapted framework from QuaLLM into QuaLLM-Health for extracting clinically relevant quantitative data from unstructured text.<n>We collected 410k posts and comments from five GLP-1-related communities using the Reddit API in July 2024.<n>Applying the framework to the full dataset enabled efficient extraction of variables necessary for downstream analysis.
Score: 30.089810404792
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Health-related discussions on social media like Reddit offer valuable insights, but extracting quantitative data from unstructured text is challenging. In this work, we present an adapted framework from QuaLLM into QuaLLM-Health for extracting clinically relevant quantitative data from Reddit discussions about glucagon-like peptide-1 (GLP-1) receptor agonists using large language models (LLMs). We collected 410k posts and comments from five GLP-1-related communities using the Reddit API in July 2024. After filtering for cancer-related discussions, 2,059 unique entries remained. We developed annotation guidelines to manually extract variables such as cancer survivorship, family cancer history, cancer types mentioned, risk perceptions, and discussions with physicians. Two domain-experts independently annotated a random sample of 100 entries to create a gold-standard dataset. We then employed iterative prompt engineering with OpenAI's "GPT-4o-mini" on the gold-standard dataset to build an optimized pipeline that allowed us to extract variables from the large dataset. The optimized LLM achieved accuracies above 0.85 for all variables, with precision, recall and F1 score macro averaged > 0.90, indicating balanced performance. Stability testing showed a 95% match rate across runs, confirming consistency. Applying the framework to the full dataset enabled efficient extraction of variables necessary for downstream analysis, costing under $3 and completing in approximately one hour. QuaLLM-Health demonstrates that LLMs can effectively and efficiently extract clinically relevant quantitative data from unstructured social media content. Incorporating human expertise and iterative prompt refinement ensures accuracy and reliability. This methodology can be adapted for large-scale analysis of patient-generated data across various health domains, facilitating valuable insights for healthcare research.

Related papers

Prostate Cancer Classification Using Multimodal Feature Fusion and Explainable AI [2.656041527404895]
We propose an explainable AI system combining BERT (for textual clinical notes) and Random Forest (for numerical lab data)<n>Our work demonstrates that a simple yet interpretable BERT+RF pipeline delivers clinically significant improvements.
arXiv Detail & Related papers (2025-07-28T11:07:17Z)
A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models [7.923208324118286]
We study patterns in the performance of OpenAI LLMs across a diverse sampling of biomedical relation extraction tasks. We found the zero-shot performances to be proximal to that of fine-tuned methods.
arXiv Detail & Related papers (2025-04-05T07:08:54Z)
TheBlueScrubs-v1, a comprehensive curated medical dataset derived from the internet [1.4043931310479378]
TheBlueScrubs-v1 is a curated dataset of over 25 billion medical tokens drawn from a broad-scale internet corpus. Each text is assigned three LLM-based quality scores encompassing medical relevance, precision and factual detail, and safety and ethical standards. This Data Descriptor details the dataset's creation and validation, underscoring its potential utility for medical AI research.
arXiv Detail & Related papers (2025-04-01T22:25:19Z)
Leveraging large language models for structured information extraction from pathology reports [0.0]
We evaluate large language models' accuracy in extracting structured information from breast cancer histopathology reports. Open-source tool for structured information extraction can be customized by non-programmers using natural language.
arXiv Detail & Related papers (2025-02-14T21:46:02Z)
Enhanced Electronic Health Records Text Summarization Using Large Language Models [0.0]
This project builds on prior work by creating a system that generates clinician-preferred, focused summaries. The proposed system leverages the Flan-T5 model to generate tailored EHR summaries based on clinician-specified topics.
arXiv Detail & Related papers (2024-10-12T19:36:41Z)
Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback. Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z)
Reshaping Free-Text Radiology Notes Into Structured Reports With Generative Transformers [0.29530625605275984]
structured reporting (SR) has been recommended by various medical societies. We propose a pipeline to extract information from free-text reports. Our work aims to leverage the potential of Natural Language Processing (NLP) and Transformer-based models.
arXiv Detail & Related papers (2024-03-27T18:38:39Z)
BioFusionNet: Deep Learning-Based Survival Risk Stratification in ER+ Breast Cancer Through Multifeature and Multimodal Data Fusion [16.83901927767791]
We present BioFusionNet, a deep learning framework that fuses image-derived features with genetic and clinical data to obtain a holistic profile. Our model achieves a mean concordance index of 0.77 and a time-dependent area under the curve of 0.84, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2024-02-16T14:19:33Z)
Multimodal LLMs for health grounded in individual-specific data [1.8473477867376036]
Foundation large language models (LLMs) have shown an impressive ability to solve tasks across a wide range of fields including health. We take a step towards creating multimodal LLMs for health that are grounded in individual-specific data. We show that HeLM can effectively use demographic and clinical features in addition to high-dimensional time-series data to estimate disease risk.
arXiv Detail & Related papers (2023-07-18T07:12:46Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining. We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data. Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z)
Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings. We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
Text Mining to Identify and Extract Novel Disease Treatments From Unstructured Datasets [56.38623317907416]
We use Google Cloud to transcribe podcast episodes of an NPR radio show. We then build a pipeline for systematically pre-processing the text. Our model successfully identified that Omeprazole can help treat heartburn.
arXiv Detail & Related papers (2020-10-22T19:52:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.