Related papers: Confidence Estimation for Error Detection in Text-to-SQL Systems

Confidence Estimation for Error Detection in Text-to-SQL Systems

URL: http://arxiv.org/abs/2501.09527v1
Date: Thu, 16 Jan 2025 13:23:07 GMT
Title: Confidence Estimation for Error Detection in Text-to-SQL Systems
Authors: Oleg Somov, Elena Tutubalina,
Abstract summary: This study investigates the integration of selective classifiers into Text-to-learning systems.<n>We show that encoder-decoder T5 is better calibrated than in-context GPT 4 and decoder-only Llama 3.<n>In terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations.
Score: 5.636160825241556
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-SQL enables users to interact with databases through natural language, simplifying the retrieval and synthesis of information. Despite the success of large language models (LLMs) in converting natural language questions into SQL queries, their broader adoption is limited by two main challenges: achieving robust generalization across diverse queries and ensuring interpretative confidence in their predictions. To tackle these issues, our research investigates the integration of selective classifiers into Text-to-SQL systems. We analyse the trade-off between coverage and risk using entropy based confidence estimation with selective classifiers and assess its impact on the overall performance of Text-to-SQL models. Additionally, we explore the models' initial calibration and improve it with calibration techniques for better model alignment between confidence and accuracy. Our experimental results show that encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and decoder-only Llama 3, thus the designated external entropy-based selective classifier has better performance. The study also reveal that, in terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations.

Related papers

RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z)
Calibrating LLMs for Text-to-SQL Parsing by Leveraging Sub-clause Frequencies [28.281517110365037]
We study the problem of providing a calibrated confidence score that conveys the likelihood of an output query being correct.<n>Our work is the first to establish a benchmark for post-hoc calibration of text-to- parsing.
arXiv Detail & Related papers (2025-05-27T01:01:55Z)
SQLCritic: Correcting Text-to-SQL Generation via Clause-wise Critic [0.8098097078441623]
We propose a novel approach combining structured execution feedback with a trained critic agent that provides detailed, interpretable critiques. This method effectively identifies and corrects both syntactic and semantic errors, enhancing accuracy and interpretability.
arXiv Detail & Related papers (2025-03-11T02:52:39Z)
Reliable Text-to-SQL with Adaptive Abstention [21.07332675929629]
We present a novel framework that enhances query generation reliability by incorporating abstention and human-in-the-loop mechanisms. We validate our approach through comprehensive experiments on the BIRD benchmark, demonstrating significant improvements in robustness and reliability.
arXiv Detail & Related papers (2025-01-18T19:36:37Z)
Text-to-SQL Calibration: No Need to Ask -- Just Rescale Model Probabilities [20.606333546028516]
We show that a straightforward baseline -- deriving confidence from the model's full-sequence probability -- outperforms recent methods. Our comprehensive evaluation, conducted across two widely-used Text-to-checking benchmarks and multiple architectures, provides valuable insights into the effectiveness of various calibration strategies.
arXiv Detail & Related papers (2024-11-23T19:20:24Z)
Epistemic Integrity in Large Language Models [11.173637560124828]
Large language models are increasingly relied upon sources of information, but their propensity for false or misleading statements poses high risks for users and society. In this paper, we confront the critical problem of miscalibration where a model's linguistic assertiveness fails to reflect its true internal certainty. We introduce a new human misalignment evaluation and a novel method for measuring the linguistic assertiveness of Large Language Models.
arXiv Detail & Related papers (2024-11-10T17:10:13Z)
TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring [11.78795632771211]
We introduce a novel benchmark designed to evaluate text-to- reliability as a model's ability to correctly handle any type of input question. We evaluate existing methods using a novel penalty-based scoring metric with two modeling approaches.
arXiv Detail & Related papers (2024-03-23T16:12:52Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Transformer-based approaches to Sentiment Detection [55.41644538483948]
We examined the performance of four different types of state-of-the-art transformer models for text classification. The RoBERTa transformer model performs best on the test dataset with a score of 82.6% and is highly recommended for quality predictions.
arXiv Detail & Related papers (2023-03-13T17:12:03Z)
Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness [115.66421993459663]
Recent studies reveal that text-to- models are vulnerable to task-specific perturbations. We propose a comprehensive robustness benchmark based on Spider to diagnose the model. We conduct a diagnostic study of the state-of-the-art models on the set.
arXiv Detail & Related papers (2023-01-21T03:57:18Z)
SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN) Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z)
Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.