Finding Cross-rule Optimization Bugs in Datalog Engines
- URL: http://arxiv.org/abs/2402.12863v1
- Date: Tue, 20 Feb 2024 09:54:52 GMT
- Title: Finding Cross-rule Optimization Bugs in Datalog Engines
- Authors: Chi Zhang, Linzhang Wang, Manuel Rigger
- Abstract summary: We propose an automated testing approach called Incremental Rule Evaluation (IRE)
IRE tackles the test oracle and test case generation problem.
We implement IRE as a tool named Deopt and evaluate it on four Datalog engines.
- Score: 8.849383195527627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Datalog is a popular and widely-used declarative logic programming language.
Datalog engines apply many cross-rule optimizations; bugs in them can cause
incorrect results. To detect such optimization bugs, we propose an automated
testing approach called Incremental Rule Evaluation (IRE), which
synergistically tackles the test oracle and test case generation problem. The
core idea behind the test oracle is to compare the results of an optimized
program and a program without cross-rule optimization; any difference indicates
a bug in the Datalog engine. Our core insight is that, for an optimized,
incrementally-generated Datalog program, we can evaluate all rules individually
by constructing a reference program to disable the optimizations that are
performed among multiple rules. Incrementally generating test cases not only
allows us to apply the test oracle for every new rule generated-we also can
ensure that every newly added rule generates a non-empty result with a given
probability and eschew recomputing already-known facts. We implemented IRE as a
tool named Deopt, and evaluated Deopt on four mature Datalog engines, namely
Souffl\'e, CozoDB, $\mu$Z, and DDlog, and discovered a total of 30 bugs. Of
these, 13 were logic bugs, while the remaining were crash and error bugs. Deopt
can detect all bugs found by queryFuzz, a state-of-the-art approach. Out of the
bugs identified by Deopt, queryFuzz might be unable to detect 5. Our
incremental test case generation approach is efficient; for example, for test
cases containing 60 rules, our incremental approach can produce 1.17$\times$
(for DDlog) to 31.02$\times$ (for Souffl\'e) as many valid test cases with
non-empty results as the naive random method. We believe that the simplicity
and the generality of the approach will lead to its wide adoption in practice.
Related papers
- EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [54.354203142828084]
We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models.
We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories.
Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%.
arXiv Detail & Related papers (2025-02-18T02:54:25Z) - Constant Optimization Driven Database System Testing [6.246028398098516]
Logic bugs are bugs that can cause database management systems (DBMSs) to silently produce incorrect results for given queries.
We propose Constant-Optimization-Driven Database Testing (CODDTest) as a novel approach for detecting logic bugs in databases.
arXiv Detail & Related papers (2025-01-20T03:32:55Z) - Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models [70.07661254213181]
We propose two principled algorithms for the test-time compute of large language models.
We prove theoretically that the failure probability of one algorithm decays to zero exponentially as its test-time compute grows.
arXiv Detail & Related papers (2024-11-29T05:29:47Z) - Preference Optimization for Reasoning with Pseudo Feedback [100.62603571434167]
We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases.
We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
arXiv Detail & Related papers (2024-11-25T12:44:02Z) - B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests [16.19318541132026]
We show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests.
We propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge.
arXiv Detail & Related papers (2024-09-13T10:22:08Z) - Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency [15.593172556501704]
We introduce Code-Optimise, a framework that incorporates both correctness (passed, failed) and runtime as learning signals.
Our framework is both lightweight and robust as it dynamically selects solutions to reduce overfitting.
As a by-product, the average length of the generated solutions is reduced by up to 48% on MBPP and 23% on HumanEval.
arXiv Detail & Related papers (2024-06-18T11:05:37Z) - Evolutionary Generative Fuzzing for Differential Testing of the Kotlin
Compiler [14.259471945857431]
We investigate the effectiveness of differential testing in finding bugs within the Kotlin compilers developed at JetBrains.
We propose a black-box generative approach that creates input programs for the K1 and K2 compilers.
Our case study shows that the proposed approach effectively detects bugs in K1 and K2; these bugs have been confirmed and (some) fixed by JetBrains developers.
arXiv Detail & Related papers (2024-01-12T16:01:12Z) - FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization
Bugs [92.47146416628965]
FuzzyFlow is a fault localization and test case extraction framework designed to test program optimizations.
We leverage dataflow program representations to capture a fully reproducible system state and area-of-effect for optimizations.
To reduce testing time, we design an algorithm for minimizing test inputs, trading off memory for recomputation.
arXiv Detail & Related papers (2023-06-28T13:00:17Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z) - Bayesian Algorithm Execution: Estimating Computable Properties of
Black-box Functions Using Mutual Information [78.78486761923855]
In many real world problems, we want to infer some property of an expensive black-box function f, given a budget of T function evaluations.
We present a procedure, InfoBAX, that sequentially chooses queries that maximize mutual information with respect to the algorithm's output.
On these problems, InfoBAX uses up to 500 times fewer queries to f than required by the original algorithm.
arXiv Detail & Related papers (2021-04-19T17:22:11Z) - Revisiting Bayesian Optimization in the light of the COCO benchmark [1.4467794332678539]
This article reports a large investigation about the effects on the performance of (Gaussian process based) BO of common and less common design choices.
The code developed for this study makes the new version (v2.1.1) of the R package DiceOptim available on CRAN.
arXiv Detail & Related papers (2021-03-30T19:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.