EMR-AGENT: Automating Cohort and Feature Extraction from EMR Databases
- URL: http://arxiv.org/abs/2510.00549v2
- Date: Thu, 02 Oct 2025 03:21:52 GMT
- Title: EMR-AGENT: Automating Cohort and Feature Extraction from EMR Databases
- Authors: Kwanhyung Lee, Sungsoo Hong, Joonhyung Park, Jeonghyeop Lim, Juhwan Choi, Donghwee Yoon, Eunho Yang,
- Abstract summary: EMR-AGENT is an agent-based framework that replaces manual rule writing with dynamic, language model-driven interaction.<n>Our framework automates cohort selection, feature extraction, and code mapping through interactive querying of databases.<n>Results demonstrate strong performance and generalization across three EMR databases.
- Score: 41.15581072407935
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Machine learning models for clinical prediction rely on structured data extracted from Electronic Medical Records (EMRs), yet this process remains dominated by hardcoded, database-specific pipelines for cohort definition, feature selection, and code mapping. These manual efforts limit scalability, reproducibility, and cross-institutional generalization. To address this, we introduce EMR-AGENT (Automated Generalized Extraction and Navigation Tool), an agent-based framework that replaces manual rule writing with dynamic, language model-driven interaction to extract and standardize structured clinical data. Our framework automates cohort selection, feature extraction, and code mapping through interactive querying of databases. Our modular agents iteratively observe query results and reason over schema and documentation, using SQL not just for data retrieval but also as a tool for database observation and decision making. This eliminates the need for hand-crafted, schema-specific logic. To enable rigorous evaluation, we develop a benchmarking codebase for three EMR databases (MIMIC-III, eICU, SICdb), including both seen and unseen schema settings. Our results demonstrate strong performance and generalization across these databases, highlighting the feasibility of automating a process previously thought to require expert-driven design. The code will be released publicly at https://github.com/AITRICS/EMR-AGENT/tree/main. For a demonstration, please visit our anonymous demo page: https://anonymoususer-max600.github.io/EMR_AGENT/
Related papers
- IESR:Efficient MCTS-Based Modular Reasoning for Text-to-SQL with Large Language Models [10.758655501692793]
We propose a framework named IESR(Information Enhanced Structured Reasoning) for lightweight large language models.<n>We show that IESR achieves state-of-the-art performance on the complex reasoning benchmark LogicCat and the Archer dataset.<n>Our analysis reveals that current coder models exhibit notable biases and deficiencies in physical knowledge, mathematical computation, and common-sense reasoning.
arXiv Detail & Related papers (2026-02-05T07:10:45Z) - TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents [64.11547566154947]
We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database.<n>We introduce a new benchmark featuring common demands such as data infilling, row population, and column addition.<n>Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models.
arXiv Detail & Related papers (2025-10-28T02:49:40Z) - QueryGym: Step-by-Step Interaction with Relational Databases [30.757678338337055]
We introduce QueryGym, an interactive environment for building, testing, and evaluating LLM-based query planning agents.<n>Existing frameworks often tie agents to specific query language dialects or obscure their reasoning.<n>QueryGym requires agents to construct explicit sequences of relational algebra operations.
arXiv Detail & Related papers (2025-09-25T22:48:49Z) - Agentic AI framework for End-to-End Medical Data Inference [5.871161259593687]
We introduce an Agentic AI framework that automates the entire clinical data pipeline, from ingestion to inference.<n>We evaluate the system on publicly available datasets from geriatrics, palliative care, and colonoscopy imaging.
arXiv Detail & Related papers (2025-07-24T05:56:25Z) - THOR: Transformer Heuristics for On-Demand Retrieval [10.667949307405983]
We introduce the THOR (Transformer Heuristics for On-Demand Retrieval) Module, designed and implemented by eSapiens.<n>The THOR Module empowers non-language users to access live data with zero-language simplicity and enterprise-grade safety.
arXiv Detail & Related papers (2025-07-13T11:48:24Z) - Leveraging Foundation Language Models (FLMs) for Automated Cohort Extraction from Large EHR Databases [50.552056536968166]
We propose and evaluate an algorithm for automating column matching on two large, popular and publicly-accessible EHR databases.<n>Our approach achieves a high top-three accuracy of $92%$, correctly matching $12$ out of the $13$ columns of interest, when using a small, pre-trained general purpose language model.
arXiv Detail & Related papers (2024-12-16T06:19:35Z) - UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics.
We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z) - An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians.
Recent studies have achieved promising results in automatic impression generation using large-scale medical text data.
These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z) - Improving Text-to-SQL Semantic Parsing with Fine-grained Query
Understanding [84.04706075621013]
We present a general-purpose, modular neural semantic parsing framework based on token-level fine-grained query understanding.
Our framework consists of three modules: named entity recognizer (NER), neural entity linker (NEL) and neural entity linker (NSP)
arXiv Detail & Related papers (2022-09-28T21:00:30Z) - Interpretable and Low-Resource Entity Matching via Decoupling Feature
Learning from Decision Making [22.755892575582788]
Entity Matching aims at recognizing entity records that denote the same real-world object.
We propose a novel EM framework that consists of Heterogeneous Information Fusion (HIF) and Key Attribute Tree (KAT) Induction.
Our method is highly efficient and outperforms SOTA EM models in most cases.
arXiv Detail & Related papers (2021-06-08T08:27:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.