Leveraging LLMs for Structured Data Extraction from Unstructured Patient Records
- URL: http://arxiv.org/abs/2512.13700v1
- Date: Wed, 03 Dec 2025 14:10:12 GMT
- Title: Leveraging LLMs for Structured Data Extraction from Unstructured Patient Records
- Authors: Mitchell A. Klusty, Elizabeth C. Solie, Caroline N. Leach, W. Vaiden Logan, Lynnet E. Richey, John C. Gensel, David P. Szczykutowicz, Bryan C. McLellan, Emily B. Collier, Samuel E. Armstrong, V. K. Cody Bumgardner,
- Abstract summary: Manual chart review remains an extremely time-consuming and resource-intensive component of clinical research.<n>We present a framework for automated structured feature extraction from clinical notes leveraging locally deployed large language models (LLMs)<n>This framework demonstrates the potential of LLM systems to reduce the burden of manual chart review and increase consistency in data capture.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Manual chart review remains an extremely time-consuming and resource-intensive component of clinical research, requiring experts to extract often complex information from unstructured electronic health record (EHR) narratives. We present a secure, modular framework for automated structured feature extraction from clinical notes leveraging locally deployed large language models (LLMs) on institutionally approved, Health Insurance Portability and Accountability Act (HIPPA)-compliant compute infrastructure. This system integrates retrieval augmented generation (RAG) and structured response methods of LLMs into a widely deployable and scalable container to provide feature extraction for diverse clinical domains. In evaluation, the framework achieved high accuracy across multiple medical characteristics present in large bodies of patient notes when compared against an expert-annotated dataset and identified several annotation errors missed in manual review. This framework demonstrates the potential of LLM systems to reduce the burden of manual chart review through automated extraction and increase consistency in data capture, accelerating clinical research.
Related papers
- Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science [3.4325249294405555]
This study applies Large Language Models (LLMs) to two foundational Electronic Health Record (EHR) data science tasks.<n>We test the ability of LLMs to interact accurately with large structured datasets for analytics.<n>We present a flexible evaluation framework that automatically generates synthetic question and answer pairs tailored to the characteristics of each dataset or task.
arXiv Detail & Related papers (2026-01-28T14:57:36Z) - CNSight: Evaluation of Clinical Note Segmentation Tools [3.673249612734457]
We evaluate rule-based baselines, domain-specific transformer models, and large language models for clinical note segmentation using a curated dataset of 1,000 notes from MIMIC-IV.<n>Our experiments show that large API-based models achieve the best overall performance, with GPT-5-mini reaching a best average F1 of 72.4 across sentence-level and freetext segmentation.
arXiv Detail & Related papers (2025-12-28T05:40:15Z) - Additive Large Language Models for Semi-Structured Text [3.073796943975155]
CALM is an interpretable framework for semi-structured text.<n>It predicts outcomes as the additive sum of each component's contribution.<n>It achieves performance comparable to conventional Large Language Models.
arXiv Detail & Related papers (2025-11-14T23:06:16Z) - Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation [61.350584471060756]
Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images.<n>We propose Self-Supervised Anatomical Consistency Learning (SS-ACL) to align generated reports with corresponding anatomical regions.<n>SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy.
arXiv Detail & Related papers (2025-09-30T08:59:06Z) - From EMR Data to Clinical Insight: An LLM-Driven Framework for Automated Pre-Consultation Questionnaire Generation [9.269061009613033]
We propose a novel framework for generating pre-consultation questionnaires from complex Electronic Medical Records (EMRs)<n>This framework overcomes limitations of direct methods by building explicit clinical knowledge.<n> Evaluated on a real-world EMR dataset and validated by clinical experts, our method demonstrates superior performance in information coverage, diagnostic relevance, understandability, and generation time.
arXiv Detail & Related papers (2025-08-01T12:24:49Z) - Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z) - Analise Semantica Automatizada com LLM e RAG para Bulas Farmaceuticas [0.0]
This work investigates the use of RAG (Retrieval-Augmented Generation) architectures combined with Large-Scale Language Models (LLMs) to automate the analysis of documents in PDF format.<n>The proposal integrates vector search techniques by embeddings, semantic data extraction and generation of contextualized natural language responses.
arXiv Detail & Related papers (2025-07-07T17:48:15Z) - GENIE: Generative Note Information Extraction model for structuring EHR data [14.057531175321113]
We introduce GENIE, a Generative Note Information Extraction system.<n> GENIE processes entire paragraphs in a single pass, extracting entities, assertion statuses, locations, modifier, values, and purposes with high accuracy.<n>Using a robust data preparation pipeline and fine-tuned small scale LLMs, GENIE achieves competitive performance across multiple information extraction tasks.
arXiv Detail & Related papers (2025-01-30T15:42:24Z) - Enhancing In-Hospital Mortality Prediction Using Multi-Representational Learning with LLM-Generated Expert Summaries [3.5508427067904864]
In-hospital mortality (IHM) prediction for ICU patients is critical for timely interventions and efficient resource allocation.
This study integrates structured physiological data and clinical notes with Large Language Model (LLM)-generated expert summaries to improve IHM prediction accuracy.
arXiv Detail & Related papers (2024-11-25T16:36:38Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.