Related papers: Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

URL: http://arxiv.org/abs/2601.09876v1
Date: Wed, 14 Jan 2026 21:12:06 GMT
Title: Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL
Authors: Yifei Shen, Yilun Zhao, Justice Ou, Tinglin Huang, Arman Cohan,
Abstract summary: CLIN is a benchmark of 633 expert-annotated tasks on MIMICIV v3.1.<n>We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement.<n>Despite recent advances, performance remains far from clinical reliability.
Score: 63.578576078216976
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-world clinical text-to-SQL requires reasoning over heterogeneous EHR tables, temporal windows, and patient-similarity cohorts to produce executable queries. We introduce CLINSQL, a benchmark of 633 expert-annotated tasks on MIMIC-IV v3.1 that demands multi-table joins, clinically meaningful filters, and executable SQL. Solving CLINSQL entails navigating schema metadata and clinical coding systems, handling long contexts, and composing multi-step queries beyond traditional text-to-SQL. We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement and use rubric-based SQL analysis with execution checks that prioritize critical clinical requirements. Despite recent advances, performance remains far from clinical reliability: on the test set, GPT-5-mini attains 74.7% execution score, DeepSeek-R1 leads open-source at 69.2% and Gemini-2.5-Pro drops from 85.5% on Easy to 67.2% on Hard. Progress on CLINSQL marks tangible advances toward clinically reliable text-to-SQL for real-world EHR analytics.

Related papers

CORE-T: COherent REtrieval of Tables for Text-to-SQL [91.76918495375384]
CORE-T is a scalable, training-free framework that enriches tables with purpose metadata and pre-computes a lightweight table-compatibility cache.<n>Across Bird, Spider, and MMQA, CORE-T improves table-selection F1 by up to 22.7 points while retrieving up to 42% fewer tables.
arXiv Detail & Related papers (2026-01-19T14:51:23Z)
Query Carefully: Detecting the Unanswerables in Text-to-SQL Tasks [1.7781743265224403]
Text-to- systems allow non- experts to interact with databases using natural language.<n>Their tendency to generate executablesql for ambiguous, out-of-scope, or unanswerable queries introduces a hidden risk, as outputs may be misinterpreted as correct.<n>We present Query, a pipeline that integratessql generation with explicit ambiguity and handling of unanswerable inputs.
arXiv Detail & Related papers (2025-12-19T12:22:27Z)
SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering [18.161591137171623]
We introduce SCARE, a benchmark for evaluating methods that function as a post-hoc safety layer in EHR QA systems.<n>SCARE evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidatesql queries.
arXiv Detail & Related papers (2025-11-13T06:35:29Z)
Reliable Curation of EHR Dataset via Large Language Models under Environmental Constraints [11.502074619844125]
CELEC is a large language model (LLM)-powered framework for automated EHR data extraction and analytics.<n>On a subset of the EHR benchmark, CELEC execution accuracy achieves while maintaining low latency, cost efficiency, and strict privacy.
arXiv Detail & Related papers (2025-11-02T02:45:54Z)
RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z)
BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases [20.708207067646033]
We introduce Biomed, the first benchmark explicitly designed to evaluate scientific reasoning over a real-world biomedical knowledge base.<n> Biomed comprises 68,000 question/ query/answer triples generated from templates and grounded in a harmonized BigQuery knowledge base.<n>Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BM, achieves 62.6%.
arXiv Detail & Related papers (2025-05-23T17:58:07Z)
ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback [49.21833666405111]
Large language models (LLMs) excel in many reasoning tasks, but their ability to leverage Chain-of-Thought (CoT) reasoning remains underexplored.<n>We propose ExCoT, a novel framework that iteratively optimize open-source LLMs by combining CoT reasoning with off-policy and on-policy DPO.
arXiv Detail & Related papers (2025-03-25T18:17:36Z)
OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment [6.2089733671434875]
We propose OpenSearch-, which divides the Text-to-agent task into four main modules: Preprocessing, Extraction, Generation, and Refinement, along with an Alignment module based on consistency alignment mechanism.<n>These methods have significantly improved the performance of LLMs in the Text-to-agent task.<n> Experimental results show that OpenSearch- achieves an execution accuracy(EX) of 69.3% on the BIRD development set, 72.28% on the test set, and a reward-based efficiency score (R-VES) of 69.3, with all three metrics ranking first at the time of submission.
arXiv Detail & Related papers (2025-02-19T07:51:50Z)
RSL-SQL: Robust Schema Linking in Text-to-SQL Generation [51.00761167842468]
We propose a novel framework called RSL- that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction. benchmarks demonstrate that our approach achieves SOTA execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on GPT-4ocorrection. Our approach outperforms a series of GPT-4 based Text-to-Seek systems when adopting DeepSeek (much cheaper) with same intact prompts.
arXiv Detail & Related papers (2024-10-31T16:22:26Z)
LG AI Research & KAIST at EHRSQL 2024: Self-Training Large Language Models with Pseudo-Labeled Unanswerable Questions for a Reliable Text-to-SQL System on EHRs [58.59113843970975]
Text-to-answer models are pivotal for making Electronic Health Records accessible to healthcare professionals without knowledge. We present a self-training strategy using pseudo-labeled un-answerable questions to enhance the reliability of text-to-answer models for EHRs.
arXiv Detail & Related papers (2024-05-18T03:25:44Z)
COMPOSE: Cross-Modal Pseudo-Siamese Network for Patient Trial Matching [70.08786840301435]
We propose CrOss-Modal PseudO-SiamEse network (COMPOSE) to address these challenges for patient-trial matching. Experiment results show COMPOSE can reach 98.0% AUC on patient-criteria matching and 83.7% accuracy on patient-trial matching.
arXiv Detail & Related papers (2020-06-15T21:01:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.