Automated Discovery of Test Oracles for Database Management Systems Using LLMs
- URL: http://arxiv.org/abs/2510.06663v1
- Date: Wed, 08 Oct 2025 05:29:11 GMT
- Title: Automated Discovery of Test Oracles for Database Management Systems Using LLMs
- Authors: Qiuyang Mang, Runyuan He, Suyang Zhong, Xiaoxuan Liu, Huanchen Zhang, Alvin Cheung,
- Abstract summary: This paper explores the use of large language models (LLMs) to automate the discovery and instantiation of test oracles.<n>LLMs are prone to hallucinations that can produce numerous false positive bug reports.<n>We introduce Argus, a novel framework built upon the core concept of the Constrained Abstract Query.
- Score: 13.143749352093474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since 2020, automated testing for Database Management Systems (DBMSs) has flourished, uncovering hundreds of bugs in widely-used systems. A cornerstone of these techniques is test oracle, which typically implements a mechanism to generate equivalent query pairs, thereby identifying bugs by checking the consistency between their results. However, while applying these oracles can be automated, their design remains a fundamentally manual endeavor. This paper explores the use of large language models (LLMs) to automate the discovery and instantiation of test oracles, addressing a long-standing bottleneck towards fully automated DBMS testing. Although LLMs demonstrate impressive creativity, they are prone to hallucinations that can produce numerous false positive bug reports. Furthermore, their significant monetary cost and latency mean that LLM invocations should be limited to ensure that bug detection is efficient and economical. To this end, we introduce Argus, a novel framework built upon the core concept of the Constrained Abstract Query - a SQL skeleton containing placeholders and their associated instantiation conditions (e.g., requiring a placeholder to be filled by a boolean column). Argus uses LLMs to generate pairs of these skeletons that are asserted to be semantically equivalent. This equivalence is then formally proven using a SQL equivalence solver to ensure soundness. Finally, the placeholders within the verified skeletons are instantiated with concrete, reusable SQL snippets that are also synthesized by LLMs to efficiently produce complex test cases. We implemented Argus and evaluated it on five extensively tested DBMSs, discovering 40 previously unknown bugs, 35 of which are logic bugs, with 36 confirmed and 26 already fixed by the developers.
Related papers
- ErrorLLM: Modeling SQL Errors for Text-to-SQL Refinement [57.98138819417949]
We propose ErrorLLM, a framework that explicitly models text-to- querying.<n>We show that ErrorLLM achieves the most significant improvements over backbone initial generation.<n>ErrorLLM addresses both sides by high detection F1 score while maintaining refinement effectiveness.
arXiv Detail & Related papers (2026-03-04T05:27:20Z) - FuzzySQL: Uncovering Hidden Vulnerabilities in DBMS Special Features with LLM-Driven Fuzzing [37.235342117305684]
Fuzzy unifies rule-based patching with semantic repair to correct syntactic and context-sensitive failures.<n>We uncover 37 vulnerabilities, 7 of which are tied to under-tested special features.<n>Our results highlight the limitations of conventional fuzzers in semantic feature coverage.
arXiv Detail & Related papers (2026-02-23T04:20:19Z) - Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z) - LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z) - Hallucination to Consensus: Multi-Agent LLMs for End-to-End Test Generation [2.794277194464204]
Unit testing plays a critical role in ensuring software correctness.<n>Traditional methods rely on search-based or randomized algorithms to achieve high code coverage.<n>We propose CANDOR, a novel prompt engineering-based LLM framework for automated unit test generation in Java.
arXiv Detail & Related papers (2025-06-03T14:43:05Z) - Testing Database Systems with Large Language Model Synthesized Fragments [3.3302293148249125]
We propose ShQveL, an approach that augments existingsql test-case generators by leveraging Large Language Models (LLMs)<n>We evaluated ShQveL on 5 iterations and discovered 55 unique and previously unknown bugs, 50 of which were promptly fixed after our reports.
arXiv Detail & Related papers (2025-05-04T06:48:01Z) - Scaling Automated Database System Testing [3.3302293148249125]
We present a vision and a platform to apply test oracles to any database that supports a subset of commonsql features.<n>In this work, we present both a vision and a platform, SQLancer++, to apply test oracles to any database that supports a subset of commonsql features.
arXiv Detail & Related papers (2025-03-27T12:10:36Z) - Can the Rookies Cut the Tough Cookie? Exploring the Use of LLMs for SQL Equivalence Checking [15.42143912008553]
We introduce a novel, realistic, and sufficiently complex benchmark called SQLEquiQuest for query equivalence checking.<n>We evaluate several state-of-the-art LLMs using various prompting strategies and carefully constructed in-context learning examples.<n>Our analysis shows that LLMs exhibit a strong bias for equivalence predictions, with consistently poor performance over non-equivalent pairs.
arXiv Detail & Related papers (2024-12-07T06:50:12Z) - PTD-SQL: Partitioning and Targeted Drilling with LLMs in Text-to-SQL [54.304872649870575]
Large Language Models (LLMs) have emerged as powerful tools for Text-to-sense tasks.
In this study, we propose that employing query group partitioning allows LLMs to focus on learning the thought processes specific to a single problem type.
arXiv Detail & Related papers (2024-09-21T09:33:14Z) - Test Oracle Automation in the era of LLMs [52.69509240442899]
Large Language Models (LLMs) have demonstrated remarkable proficiency in tackling diverse software testing tasks.
This paper aims to enable discussions on the potential of using LLMs for test oracle automation, along with the challenges that may emerge during the generation of various types of oracles.
arXiv Detail & Related papers (2024-05-21T13:19:10Z) - ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models [46.07900122810749]
Large language models (LLMs) have achieved unprecedented performances in various applications, yet evaluating them is still challenging.
We contend that utilizing existing relational databases is a promising approach for constructing benchmarks.
We propose ERBench, which uses these integrity constraints to convert any database into an LLM benchmark.
arXiv Detail & Related papers (2024-03-08T12:42:36Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.