Universal Abstraction: Harnessing Frontier Models to Structure Real-World Data at Scale
- URL: http://arxiv.org/abs/2502.00943v1
- Date: Sun, 02 Feb 2025 22:31:54 GMT
- Title: Universal Abstraction: Harnessing Frontier Models to Structure Real-World Data at Scale
- Authors: Cliff Wong, Sam Preston, Qianchu Liu, Zelalem Gero, Jass Bagga, Sheng Zhang, Shrey Jain, Theodore Zhao, Yu Gu, Yanbo Xu, Sid Kiblawi, Roshanthi Weerasinghe, Rom Leidner, Kristina Young, Brian Piening, Carlo Bifulco, Tristan Naumann, Mu Wei, Hoifung Poon,
- Abstract summary: We propose UniMedAbstractor (UMA), a zero-shot medical abstraction framework leveraging Large Language Models (LLMs)
UMA can quickly scale to new attributes through its universal prompt template without curating attribute-specific training labels or rules.
We evaluate UMA for oncology applications, focusing on fifteen key attributes representing the cancer patient journey.
- Score: 16.483707546185595
- License:
- Abstract: The vast majority of real-world patient information resides in unstructured clinical text, and the process of medical abstraction seeks to extract and normalize structured information from this unstructured input. However, traditional medical abstraction methods can require significant manual efforts that can include crafting rules or annotating training labels, limiting scalability. In this paper, we propose UniMedAbstractor (UMA), a zero-shot medical abstraction framework leveraging Large Language Models (LLMs) through a modular and customizable prompt template. We refer to our approach as universal abstraction as it can quickly scale to new attributes through its universal prompt template without curating attribute-specific training labels or rules. We evaluate UMA for oncology applications, focusing on fifteen key attributes representing the cancer patient journey, from short-context attributes (e.g., performance status, treatment) to complex long-context attributes requiring longitudinal reasoning (e.g., tumor site, histology, TNM staging). Experiments on real-world data show UMA's strong performance and generalizability. Compared to supervised and heuristic baselines, UMA with GPT-4o achieves on average an absolute 2-point F1/accuracy improvement for both short-context and long-context attribute abstraction. For pathologic T staging, UMA even outperforms the supervised model by 20 points in accuracy.
Related papers
- GENIE: Generative Note Information Extraction model for structuring EHR data [14.057531175321113]
We introduce GENIE, a Generative Note Information Extraction system.
GENIE processes entire paragraphs in a single pass, extracting entities, assertion statuses, locations, modifier, values, and purposes with high accuracy.
Using a robust data preparation pipeline and fine-tuned small scale LLMs, GENIE achieves competitive performance across multiple information extraction tasks.
arXiv Detail & Related papers (2025-01-30T15:42:24Z) - Representation Learning of Structured Data for Medical Foundation Models [29.10129199884847]
We introduce the UniStruct architecture to design a multimodal medical foundation model of unstructured text and structured data.
Our approach is validated through model pre-training on both an extensive internal medical database and a public repository of structured medical records.
arXiv Detail & Related papers (2024-10-17T09:02:28Z) - PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation.
Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process.
Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z) - GAMedX: Generative AI-based Medical Entity Data Extractor Using Large Language Models [1.123722364748134]
This paper introduces GAMedX, a Named Entity Recognition (NER) approach utilizing Large Language Models (LLMs)
The methodology integrates open-source LLMs for NER, utilizing chained prompts and Pydantic schemas for structured output to navigate the complexities of specialized medical jargon.
The findings reveal significant ROUGE F1 score on one of the evaluation datasets with an accuracy of 98%.
arXiv Detail & Related papers (2024-05-31T02:53:22Z) - Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography [50.08496922659307]
We propose a universal framework enabling a single model, termed Universal Model, to deal with multiple public datasets and adapt to new classes.
Firstly, we introduce a novel language-driven parameter generator that leverages language embeddings from large language models.
Secondly, the conventional output layers are replaced with lightweight, class-specific heads, allowing Universal Model to simultaneously segment 25 organs and six types of tumors.
arXiv Detail & Related papers (2024-05-28T16:55:15Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - Local Large Language Models for Complex Structured Medical Tasks [0.0]
This paper introduces an approach that combines the language reasoning capabilities of large language models with the benefits of local training to tackle complex, domain-specific tasks.
Specifically, the authors demonstrate their approach by extracting structured condition codes from pathology reports.
arXiv Detail & Related papers (2023-08-03T12:36:13Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z) - Knowledge-Rich Self-Supervised Entity Linking [58.838404666183656]
Knowledge-RIch Self-Supervision ($tt KRISSBERT$) is a universal entity linker for four million UMLS entities.
Our approach subsumes zero-shot and few-shot methods, and can easily incorporate entity descriptions and gold mention labels if available.
Without using any labeled information, our method produces $tt KRISSBERT$, a universal entity linker for four million UMLS entities.
arXiv Detail & Related papers (2021-12-15T05:05:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.