Related papers: From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents

From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents

URL: http://arxiv.org/abs/2509.23415v1
Date: Sat, 27 Sep 2025 17:13:51 GMT
Title: From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents
Authors: Gyubok Lee, Woosog Chay, Heeyoung Kwak, Yeong Hwa Kim, Haanju Yoo, Oksoon Jeong, Meong Hi Son, Edward Choi,
Abstract summary: We introduce EHR-ChatQA an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents.<n>We show that while agents achieve high Pass@5 of 90-95% (at least one of five trials) on IncreQA and 60-80% on AdaptQA, their Pass5 is substantially lower by 35-60%.<n>These results underscore the need to build agents that are not only performant but also robust for the safety-critical EHR domain.
Score: 15.31222936637621
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the impressive performance of LLM-powered agents, their adoption for Electronic Health Record (EHR) data access remains limited by the absence of benchmarks that adequately capture real-world clinical data access flows. In practice, two core challenges hinder deployment: query ambiguity from vague user questions and value mismatch between user terminology and database entries. To address this, we introduce EHR-ChatQA an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents: clarifying user questions, using tools to resolve value mismatches, and generating correct SQL to deliver accurate answers. To cover diverse patterns of query ambiguity and value mismatch, EHR-ChatQA assesses agents in a simulated environment with an LLM-based user across two interaction flows: Incremental Query Refinement (IncreQA), where users add constraints to existing queries, and Adaptive Query Refinement (AdaptQA), where users adjust their search goals mid-conversation. Experiments with state-of-the-art LLMs (e.g., o4-mini and Gemini-2.5-Flash) over five i.i.d. trials show that while agents achieve high Pass@5 of 90-95% (at least one of five trials) on IncreQA and 60-80% on AdaptQA, their Pass^5 (consistent success across all five trials) is substantially lower by 35-60%. These results underscore the need to build agents that are not only performant but also robust for the safety-critical EHR domain. Finally, we provide diagnostic insights into common failure modes to guide future agent development.

Related papers

AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios [49.90735676070039]
The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow.<n>We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks.<n>We propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks.
arXiv Detail & Related papers (2026-01-28T13:49:18Z)
Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL [63.578576078216976]
CLIN is a benchmark of 633 expert-annotated tasks on MIMICIV v3.1.<n>We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement.<n>Despite recent advances, performance remains far from clinical reliability.
arXiv Detail & Related papers (2026-01-14T21:12:06Z)
SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering [18.161591137171623]
We introduce SCARE, a benchmark for evaluating methods that function as a post-hoc safety layer in EHR QA systems.<n>SCARE evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidatesql queries.
arXiv Detail & Related papers (2025-11-13T06:35:29Z)
BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions [33.59162905707337]
Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to- tasks, but real-world database applications predominantly require multi-turn interactions.<n>Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations.<n>We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a
arXiv Detail & Related papers (2025-10-06T19:31:47Z)
OptAgent: Optimizing Query Rewriting for E-commerce via Multi-Agent Simulation [1.3722079106827219]
OptAgent is a novel framework that combines multi-agent simulations with genetic algorithms to verify and optimize queries for e-commerce queries.<n>We evaluate OptAgent on a dataset of 1000 real-world e-commerce queries in five different categories.
arXiv Detail & Related papers (2025-10-04T10:41:09Z)
Compliance Brain Assistant: Conversational Agentic AI for Assisting Compliance Tasks in Enterprise Environments [2.8724171056550256]
Compliance Brain Assistant (CBA) is a conversational, agentic AI assistant designed to boost the efficiency of daily compliance tasks for personnel in enterprise environments.<n>To strike a good balance between response quality and latency, we design a user query router that can intelligently choose between FastTrack mode and FullAgentic mode.
arXiv Detail & Related papers (2025-07-23T07:51:10Z)
RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z)
MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification [5.666070277424383]
MAG-V is a framework to generate a dataset of questions that mimic customer queries.<n>Our synthetic data can improve agent performance on actual customer queries.
arXiv Detail & Related papers (2024-11-28T19:36:11Z)
Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering [48.43453390717167]
We present and tackle the problem of Embodied Question Answering with Situational Queries (S-EQA) in a household environment.<n>Unlike prior EQA work, situational queries require the agent to correctly identify multiple object-states and reach a consensus on their states for an answer.<n>We introduce a novel Prompt-Generate-Evaluate scheme that wraps around an LLM's output to generate unique situational queries and corresponding consensus object information.
arXiv Detail & Related papers (2024-05-08T00:45:20Z)
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks [84.7788065721689]
In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files. Building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench.
arXiv Detail & Related papers (2024-01-10T19:04:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.