From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents
- URL: http://arxiv.org/abs/2509.23415v1
- Date: Sat, 27 Sep 2025 17:13:51 GMT
- Title: From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents
- Authors: Gyubok Lee, Woosog Chay, Heeyoung Kwak, Yeong Hwa Kim, Haanju Yoo, Oksoon Jeong, Meong Hi Son, Edward Choi,
- Abstract summary: We introduce EHR-ChatQA an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents.<n>We show that while agents achieve high Pass@5 of 90-95% (at least one of five trials) on IncreQA and 60-80% on AdaptQA, their Pass5 is substantially lower by 35-60%.<n>These results underscore the need to build agents that are not only performant but also robust for the safety-critical EHR domain.
- Score: 15.31222936637621
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the impressive performance of LLM-powered agents, their adoption for Electronic Health Record (EHR) data access remains limited by the absence of benchmarks that adequately capture real-world clinical data access flows. In practice, two core challenges hinder deployment: query ambiguity from vague user questions and value mismatch between user terminology and database entries. To address this, we introduce EHR-ChatQA an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents: clarifying user questions, using tools to resolve value mismatches, and generating correct SQL to deliver accurate answers. To cover diverse patterns of query ambiguity and value mismatch, EHR-ChatQA assesses agents in a simulated environment with an LLM-based user across two interaction flows: Incremental Query Refinement (IncreQA), where users add constraints to existing queries, and Adaptive Query Refinement (AdaptQA), where users adjust their search goals mid-conversation. Experiments with state-of-the-art LLMs (e.g., o4-mini and Gemini-2.5-Flash) over five i.i.d. trials show that while agents achieve high Pass@5 of 90-95% (at least one of five trials) on IncreQA and 60-80% on AdaptQA, their Pass^5 (consistent success across all five trials) is substantially lower by 35-60%. These results underscore the need to build agents that are not only performant but also robust for the safety-critical EHR domain. Finally, we provide diagnostic insights into common failure modes to guide future agent development.
Related papers
- AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios [49.90735676070039]
The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow.<n>We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks.<n>We propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks.
arXiv Detail & Related papers (2026-01-28T13:49:18Z) - Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL [63.578576078216976]
CLIN is a benchmark of 633 expert-annotated tasks on MIMICIV v3.1.<n>We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement.<n>Despite recent advances, performance remains far from clinical reliability.
arXiv Detail & Related papers (2026-01-14T21:12:06Z) - SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering [18.161591137171623]
We introduce SCARE, a benchmark for evaluating methods that function as a post-hoc safety layer in EHR QA systems.<n>SCARE evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidatesql queries.
arXiv Detail & Related papers (2025-11-13T06:35:29Z) - BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions [33.59162905707337]
Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to- tasks, but real-world database applications predominantly require multi-turn interactions.<n>Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations.<n>We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a
arXiv Detail & Related papers (2025-10-06T19:31:47Z) - OptAgent: Optimizing Query Rewriting for E-commerce via Multi-Agent Simulation [1.3722079106827219]
OptAgent is a novel framework that combines multi-agent simulations with genetic algorithms to verify and optimize queries for e-commerce queries.<n>We evaluate OptAgent on a dataset of 1000 real-world e-commerce queries in five different categories.
arXiv Detail & Related papers (2025-10-04T10:41:09Z) - Compliance Brain Assistant: Conversational Agentic AI for Assisting Compliance Tasks in Enterprise Environments [2.8724171056550256]
Compliance Brain Assistant (CBA) is a conversational, agentic AI assistant designed to boost the efficiency of daily compliance tasks for personnel in enterprise environments.<n>To strike a good balance between response quality and latency, we design a user query router that can intelligently choose between FastTrack mode and FullAgentic mode.
arXiv Detail & Related papers (2025-07-23T07:51:10Z) - RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z) - MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification [5.666070277424383]
MAG-V is a framework to generate a dataset of questions that mimic customer queries.<n>Our synthetic data can improve agent performance on actual customer queries.
arXiv Detail & Related papers (2024-11-28T19:36:11Z) - Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering [48.43453390717167]
We present and tackle the problem of Embodied Question Answering with Situational Queries (S-EQA) in a household environment.<n>Unlike prior EQA work, situational queries require the agent to correctly identify multiple object-states and reach a consensus on their states for an answer.<n>We introduce a novel Prompt-Generate-Evaluate scheme that wraps around an LLM's output to generate unique situational queries and corresponding consensus object information.
arXiv Detail & Related papers (2024-05-08T00:45:20Z) - InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks [84.7788065721689]
In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks.
This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files.
Building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench.
arXiv Detail & Related papers (2024-01-10T19:04:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.