Hallucination Detection for LLM-based Text-to-SQL Generation via Two-Stage Metamorphic Testing
- URL: http://arxiv.org/abs/2512.22250v1
- Date: Wed, 24 Dec 2025 04:04:26 GMT
- Title: Hallucination Detection for LLM-based Text-to-SQL Generation via Two-Stage Metamorphic Testing
- Authors: Bo Yang, Yinfen Xia, Weisong Sun, Yang Liu,
- Abstract summary: Large language models (LLMs) generate hallucinations, i.e.,unrealistic or illogical content.<n>We propose a novel hallucination detection method based on metamorphic testing (MT) that does not require standard answers.<n>Trials demonstrate our method's superior performance in terms of the F1-score, which ranges from 69.36% to 82.76%.
- Score: 8.942002314582789
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In Text-to-SQL generation, large language models (LLMs) have shown strong generalization and adaptability. However, LLMs sometimes generate hallucinations, i.e.,unrealistic or illogical content, which leads to incorrect SQL queries and negatively impacts downstream applications. Detecting these hallucinations is particularly challenging. Existing Text-to-SQL error detection methods, which are tailored for traditional deep learning models, face significant limitations when applied to LLMs. This is primarily due to the scarcity of ground-truth data. To address this challenge, we propose SQLHD, a novel hallucination detection method based on metamorphic testing (MT) that does not require standard answers. SQLHD splits the detection task into two sequentiial stages: schema-linking hallucination detection via eight structure-aware Metamorphic Relations (MRs) that perturb comparative words, entities, sentence structure or database schema, and logical-synthesis hallucination detection via nine logic-aware MRs that mutate prefix words, extremum expressions, comparison ranges or the entire database. In each stage the LLM is invoked separately to generate schema mappings or SQL artefacts; the follow-up outputs are cross-checked against their source counterparts through the corresponding MRs, and any violation is flagged as a hallucination without requiring ground-truth SQL. The experimental results demonstrate our method's superior performance in terms of the F1-score, which ranges from 69.36\% to 82.76\%. Additionally, SQLHD demonstrates superior performance over LLM Self-Evaluation methods, effectively identifying hallucinations in Text-to-SQL tasks.
Related papers
- ErrorLLM: Modeling SQL Errors for Text-to-SQL Refinement [57.98138819417949]
We propose ErrorLLM, a framework that explicitly models text-to- querying.<n>We show that ErrorLLM achieves the most significant improvements over backbone initial generation.<n>ErrorLLM addresses both sides by high detection F1 score while maintaining refinement effectiveness.
arXiv Detail & Related papers (2026-03-04T05:27:20Z) - Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts [21.081815261690444]
Large language models (LLMs) often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge.<n>This poses serious risks in domains like healthcare, finance, and customer support.<n>We introduce CONFACTCHECK, an efficient detection approach that does not leverage any external knowledge base.
arXiv Detail & Related papers (2025-11-15T14:33:02Z) - CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description [15.080310729603466]
CRED- is a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description.<n>It bridges the gap between natural language questions (NLQs) and their correspondingsql queries.<n>CRED- achieves new state-of-git-the-art (SOTA) performance, validating its effectiveness and scalability.
arXiv Detail & Related papers (2025-08-18T09:43:07Z) - MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM [58.2298313720146]
Multimodal hallucinations are multi-sourced and arise from diverse causes.<n>Existing benchmarks fail to adequately distinguish between perception-induced hallucinations and reasoning-induced hallucinations.
arXiv Detail & Related papers (2025-05-30T05:54:36Z) - RSL-SQL: Robust Schema Linking in Text-to-SQL Generation [51.00761167842468]
We propose a novel framework called RSL- that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction.
benchmarks demonstrate that our approach achieves SOTA execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on GPT-4ocorrection.
Our approach outperforms a series of GPT-4 based Text-to-Seek systems when adopting DeepSeek (much cheaper) with same intact prompts.
arXiv Detail & Related papers (2024-10-31T16:22:26Z) - LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text.
LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z) - PTD-SQL: Partitioning and Targeted Drilling with LLMs in Text-to-SQL [54.304872649870575]
Large Language Models (LLMs) have emerged as powerful tools for Text-to-sense tasks.
In this study, we propose that employing query group partitioning allows LLMs to focus on learning the thought processes specific to a single problem type.
arXiv Detail & Related papers (2024-09-21T09:33:14Z) - DAC: Decomposed Automation Correction for Text-to-SQL [51.48239006107272]
We introduce De Automation Correction (DAC), which corrects text-to-composed by decomposing entity linking and skeleton parsing.
We show that our method improves performance by $3.7%$ on average of Spider, Bird, and KaggleDBQA compared with the baseline method.
arXiv Detail & Related papers (2024-08-16T14:43:15Z) - Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation [21.58204328067628]
Large Language Models (LLMs) driven by In-Context Learning (ICL) have significantly improved the performance of text-to-.
Previous methods generally employ a two-stage reasoning framework, namely 1) linking schema and 2) logical synthesis, making the framework not only effective but also interpretable.
Despite these advancements, the inherent bad nature of the generalization of LLMs often results in hallucinations, which limits the full potential of LLMs.
In this work, we first identify and categorize the common types of hallucinations at each stage in text-to-.
We then introduce a novel strategy, Task Alignment (TA),
arXiv Detail & Related papers (2024-05-24T07:51:08Z) - PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with Cross-consistency [19.067737007347613]
Methods achieve new SOTA results on the Spider benchmark, with an execution accuracy of 87.6%.
Our methods achieve new SOTA results on the Spider benchmark, with an execution accuracy of 87.6%.
arXiv Detail & Related papers (2024-03-13T02:32:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.