Fugu-MT 論文翻訳(概要): From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents

論文の概要: From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents

arxiv url: http://arxiv.org/abs/2509.23415v1
Date: Sat, 27 Sep 2025 17:13:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.215819
Title: From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents
Title（参考訳）: 会話からクエリ実行: EHRデータベースエージェントのユーザとツールのインタラクションのベンチマーク
Authors: Gyubok Lee, Woosog Chay, Heeyoung Kwak, Yeong Hwa Kim, Haanju Yoo, Oksoon Jeong, Meong Hi Son, Edward Choi,
Abstract要約: EHR-ChatQAはデータベースエージェントのエンドツーエンドワークフローを評価する対話型データベース質問応答ベンチマークである。エージェントはIncreQAで90-95%(少なくとも5つのトライアルのうちの1つ)、AdaptQAで60-80%、Pass5で35-60%、高いPass@5を達成する。これらの結果は、パフォーマンスだけでなく、安全クリティカルなEHRドメインにも堅牢なエージェントを構築する必要性を浮き彫りにしている。
参考スコア（独自算出の注目度）: 15.31222936637621
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the impressive performance of LLM-powered agents, their adoption for Electronic Health Record (EHR) data access remains limited by the absence of benchmarks that adequately capture real-world clinical data access flows. In practice, two core challenges hinder deployment: query ambiguity from vague user questions and value mismatch between user terminology and database entries. To address this, we introduce EHR-ChatQA an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents: clarifying user questions, using tools to resolve value mismatches, and generating correct SQL to deliver accurate answers. To cover diverse patterns of query ambiguity and value mismatch, EHR-ChatQA assesses agents in a simulated environment with an LLM-based user across two interaction flows: Incremental Query Refinement (IncreQA), where users add constraints to existing queries, and Adaptive Query Refinement (AdaptQA), where users adjust their search goals mid-conversation. Experiments with state-of-the-art LLMs (e.g., o4-mini and Gemini-2.5-Flash) over five i.i.d. trials show that while agents achieve high Pass@5 of 90-95% (at least one of five trials) on IncreQA and 60-80% on AdaptQA, their Pass^5 (consistent success across all five trials) is substantially lower by 35-60%. These results underscore the need to build agents that are not only performant but also robust for the safety-critical EHR domain. Finally, we provide diagnostic insights into common failure modes to guide future agent development.
Abstract（参考訳）: LLMを使用したエージェントの優れたパフォーマンスにもかかわらず、Electronic Health Record(EHR)データアクセスの採用は、実際の臨床データアクセスフローを適切にキャプチャするベンチマークが欠如しているため、依然として制限されている。実際には、曖昧なユーザ質問からのあいまいさのクエリと、ユーザ用語とデータベースエントリの値ミスマッチである。これを解決するために、EHR-ChatQAという対話型データベース質問応答ベンチマークを導入し、データベースエージェントのエンドツーエンドワークフローを評価する。クエリのあいまいさと値ミスマッチのさまざまなパターンをカバーするために、EHR-ChatQAは、2つのインタラクションフローにわたるLLMベースのユーザによるシミュレーション環境でエージェントを評価する: インクリメンタルクエリリファインメント(IncreQA)、ユーザが既存のクエリに制約を加えるAdaptive Query Refinement(AdaptQA)、ユーザが会話の途中で検索目標を調整するAdaptive Query Refinement(AdaptQA)。最先端のLSM(例: o4-mini と Gemini-2.5-Flash)を5回の試験で比較したところ、IncreQAでは90-95%(少なくとも5回の試験のうちの1回)、AdaptQAでは60-80%(5回の試験で連続的に成功した)の高Pass@5が35-60%低下した。これらの結果は、パフォーマンスだけでなく、安全クリティカルなEHRドメインにも堅牢なエージェントを構築する必要性を浮き彫りにしている。最後に、今後のエージェント開発を導くために、共通の障害モードに関する診断的洞察を提供する。

論文の概要: From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents

関連論文リスト