SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering
- URL: http://arxiv.org/abs/2511.17559v1
- Date: Thu, 13 Nov 2025 06:35:29 GMT
- Title: SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering
- Authors: Gyubok Lee, Woosog Chay, Edward Choi,
- Abstract summary: We introduce SCARE, a benchmark for evaluating methods that function as a post-hoc safety layer in EHR QA systems.<n>SCARE evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidatesql queries.
- Score: 18.161591137171623
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in Large Language Models (LLMs) have enabled the development of text-to-SQL models that allow clinicians to query structured data stored in Electronic Health Records (EHRs) using natural language. However, deploying these models for EHR question answering (QA) systems in safety-critical clinical environments remains challenging: incorrect SQL queries-whether caused by model errors or problematic user inputs-can undermine clinical decision-making and jeopardize patient care. While prior work has mainly focused on improving SQL generation accuracy or filtering questions before execution, there is a lack of a unified benchmark for evaluating independent post-hoc verification mechanisms (i.e., a component that inspects and validates the generated SQL before execution), which is crucial for safe deployment. To fill this gap, we introduce SCARE, a benchmark for evaluating methods that function as a post-hoc safety layer in EHR QA systems. SCARE evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidate SQL queries. The benchmark comprises 4,200 triples of questions, candidate SQL queries, and expected model outputs, grounded in the MIMIC-III, MIMIC-IV, and eICU databases. It covers a diverse set of questions and corresponding candidate SQL queries generated by seven different text-to-SQL models, ensuring a realistic and challenging evaluation. Using SCARE, we benchmark a range of approaches-from two-stage methods to agentic frameworks. Our experiments reveal a critical trade-off between question classification and SQL error correction, highlighting key challenges and outlining directions for future research.
Related papers
- Beyond Caption-Based Queries for Video Moment Retrieval [60.31221310786333]
We investigate degradation of VMR methods when trained on caption-based queries but evaluated on search queries.<n>We introduce three benchmarks by modifying the textual queries in three public VMR datasets.<n>Our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries.
arXiv Detail & Related papers (2026-03-02T20:06:41Z) - Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL [63.578576078216976]
CLIN is a benchmark of 633 expert-annotated tasks on MIMICIV v3.1.<n>We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement.<n>Despite recent advances, performance remains far from clinical reliability.
arXiv Detail & Related papers (2026-01-14T21:12:06Z) - From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents [15.31222936637621]
We introduce EHR-ChatQA an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents.<n>We show that while agents achieve high Pass@5 of 90-95% (at least one of five trials) on IncreQA and 60-80% on AdaptQA, their Pass5 is substantially lower by 35-60%.<n>These results underscore the need to build agents that are not only performant but also robust for the safety-critical EHR domain.
arXiv Detail & Related papers (2025-09-27T17:13:51Z) - DAC: Decomposed Automation Correction for Text-to-SQL [51.48239006107272]
We introduce De Automation Correction (DAC), which corrects text-to-composed by decomposing entity linking and skeleton parsing.
We show that our method improves performance by $3.7%$ on average of Spider, Bird, and KaggleDBQA compared with the baseline method.
arXiv Detail & Related papers (2024-08-16T14:43:15Z) - KU-DMIS at EHRSQL 2024:Generating SQL query via question templatization in EHR [17.998140363824174]
We introduce a novel text-to-domain framework that robustly handles out-of-domain questions and the generated queries with query execution.
We use a powerful large language model (LLM), fine-tuned GPT-3.5 with detailed prompts involving the table schemas of the EHR database system.
arXiv Detail & Related papers (2024-05-22T02:15:57Z) - LG AI Research & KAIST at EHRSQL 2024: Self-Training Large Language Models with Pseudo-Labeled Unanswerable Questions for a Reliable Text-to-SQL System on EHRs [58.59113843970975]
Text-to-answer models are pivotal for making Electronic Health Records accessible to healthcare professionals without knowledge.
We present a self-training strategy using pseudo-labeled un-answerable questions to enhance the reliability of text-to-answer models for EHRs.
arXiv Detail & Related papers (2024-05-18T03:25:44Z) - Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering [48.43453390717167]
We present and tackle the problem of Embodied Question Answering with Situational Queries (S-EQA) in a household environment.<n>Unlike prior EQA work, situational queries require the agent to correctly identify multiple object-states and reach a consensus on their states for an answer.<n>We introduce a novel Prompt-Generate-Evaluate scheme that wraps around an LLM's output to generate unique situational queries and corresponding consensus object information.
arXiv Detail & Related papers (2024-05-08T00:45:20Z) - ProbGate at EHRSQL 2024: Enhancing SQL Query Generation Accuracy through Probabilistic Threshold Filtering and Error Handling [0.0]
We introduce an entropy-based method to identify and filter out unanswerable results.
We experimentally verified that our method can filter unanswerable questions, which can be widely utilized.
arXiv Detail & Related papers (2024-04-25T14:55:07Z) - Wav2SQL: Direct Generalizable Speech-To-SQL Parsing [55.10009651476589]
Speech-to-Spider (S2Spider) aims to convert spoken questions intosql queries given databases.
We propose the first direct speech-to-speaker parsing model Wav2 which avoids error compounding across cascaded systems.
Experimental results demonstrate that Wav2 avoids error compounding and achieves state-of-the-art results by up to 2.5% accuracy improvement over the baseline.
arXiv Detail & Related papers (2023-05-21T19:26:46Z) - EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records [36.213730355895805]
The utterances were collected from 222 hospital staff members, including physicians, nurses, and insurance review and health records teams.
We manually linked these questions to two open-source EHR databases, MIMIC-III and eICU, and included various time expressions and held-out unanswerable questions in the dataset.
arXiv Detail & Related papers (2023-01-16T05:10:20Z) - RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question
Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA)
We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging.
Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.