Related papers: Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

URL: http://arxiv.org/abs/2602.12015v1
Date: Thu, 12 Feb 2026 14:46:20 GMT
Title: Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study
Authors: Angelo Ziletti, Leonardo D'Ambrosi,
Abstract summary: We propose CLUES, a framework that models Text-to- Language as a two-stage process.<n>It decomposes semantic uncertainty into an ambiguity score and an instability score.<n> CLUES improves failure prediction over state-of-the-art Kernel Entropy matrix.
Score: 0.3437656066916039
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.

Related papers

Evaluating Robustness of Reasoning Models on Parameterized Logical Problems [20.78623024814435]
Logic provides a controlled testbed for evaluating LLM-based reasoners.<n>Standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability.<n>We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2--CNF formulas.
arXiv Detail & Related papers (2026-02-13T06:54:25Z)
LatentRefusal: Latent-Signal Refusal for Unanswerable Text-to-SQL Queries [6.5781226398371615]
Unanswerable and under user queries pose a major barrier to safe deployment in text-to-specified systems.<n>LatentRefusal is a latent-signal refusal mechanism that predicts answerability from hidden activations of a large language model.<n>We show that LatentRefusal improves average F1 to 88.5 percent on both backbones while adding approximately 2 milliseconds of probe overhead.
arXiv Detail & Related papers (2026-01-15T13:48:22Z)
Node-Level Uncertainty Estimation in LLM-Generated SQL [13.436696325103147]
We introduce a semantically aware labeling algorithm that assigns node-level correctness without over-penalizing structural containers or alias variation.<n>We represent each node with a rich set of schema-aware and lexical features - capturing identifier validity, alias resolution, type compatibility, ambiguity in scope, and typo signals.<n>We interpret these probabilities as uncertainty, enabling fine-grained diagnostics that pinpoint exactly where a query is likely to be wrong.
arXiv Detail & Related papers (2025-11-17T23:31:45Z)
SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering [18.161591137171623]
We introduce SCARE, a benchmark for evaluating methods that function as a post-hoc safety layer in EHR QA systems.<n>SCARE evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidatesql queries.
arXiv Detail & Related papers (2025-11-13T06:35:29Z)
SConU: Selective Conformal Uncertainty in Large Language Models [59.25881667640868]
We propose a novel approach termed Selective Conformal Uncertainty (SConU)<n>We develop two conformal p-values that are instrumental in determining whether a given sample deviates from the uncertainty distribution of the calibration set at a specific manageable risk level.<n>Our approach not only facilitates rigorous management of miscoverage rates across both single-domain and interdisciplinary contexts, but also enhances the efficiency of predictions.
arXiv Detail & Related papers (2025-04-19T03:01:45Z)
Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond [52.246494389096654]
This paper introduces Word-Sequence Entropy (WSE), a method that calibrates uncertainty at both the word and sequence levels. We compare WSE with six baseline methods on five free-form medical QA datasets, utilizing seven popular large language models (LLMs)
arXiv Detail & Related papers (2024-02-22T03:46:08Z)
Towards preserving word order importance through Forced Invalidation [80.33036864442182]
We show that pre-trained language models are insensitive to word order. We propose Forced Invalidation to help preserve the importance of word order. Our experiments demonstrate that Forced Invalidation significantly improves the sensitivity of the models to word order.
arXiv Detail & Related papers (2023-04-11T13:42:10Z)
Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness [115.66421993459663]
Recent studies reveal that text-to- models are vulnerable to task-specific perturbations. We propose a comprehensive robustness benchmark based on Spider to diagnose the model. We conduct a diagnostic study of the state-of-the-art models on the set.
arXiv Detail & Related papers (2023-01-21T03:57:18Z)
SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN) Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z)
Dive into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty Estimation for Facial Expression Recognition [59.52434325897716]
We propose a solution, named DMUE, to address the problem of annotation ambiguity from two perspectives. For the former, an auxiliary multi-branch learning framework is introduced to better mine and describe the latent distribution in the label space. For the latter, the pairwise relationship of semantic feature between instances are fully exploited to estimate the ambiguity extent in the instance space.
arXiv Detail & Related papers (2021-04-01T03:21:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.