Related papers: Finding Cross-rule Optimization Bugs in Datalog Engines

Finding Cross-rule Optimization Bugs in Datalog Engines

URL: http://arxiv.org/abs/2402.12863v1
Date: Tue, 20 Feb 2024 09:54:52 GMT
Title: Finding Cross-rule Optimization Bugs in Datalog Engines
Authors: Chi Zhang, Linzhang Wang, Manuel Rigger
Abstract summary: We propose an automated testing approach called Incremental Rule Evaluation (IRE) IRE tackles the test oracle and test case generation problem. We implement IRE as a tool named Deopt and evaluate it on four Datalog engines.
Score: 8.849383195527627
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Datalog is a popular and widely-used declarative logic programming language. Datalog engines apply many cross-rule optimizations; bugs in them can cause incorrect results. To detect such optimization bugs, we propose an automated testing approach called Incremental Rule Evaluation (IRE), which synergistically tackles the test oracle and test case generation problem. The core idea behind the test oracle is to compare the results of an optimized program and a program without cross-rule optimization; any difference indicates a bug in the Datalog engine. Our core insight is that, for an optimized, incrementally-generated Datalog program, we can evaluate all rules individually by constructing a reference program to disable the optimizations that are performed among multiple rules. Incrementally generating test cases not only allows us to apply the test oracle for every new rule generated-we also can ensure that every newly added rule generates a non-empty result with a given probability and eschew recomputing already-known facts. We implemented IRE as a tool named Deopt, and evaluated Deopt on four mature Datalog engines, namely Souffl\'e, CozoDB, $\mu$Z, and DDlog, and discovered a total of 30 bugs. Of these, 13 were logic bugs, while the remaining were crash and error bugs. Deopt can detect all bugs found by queryFuzz, a state-of-the-art approach. Out of the bugs identified by Deopt, queryFuzz might be unable to detect 5. Our incremental test case generation approach is efficient; for example, for test cases containing 60 rules, our incremental approach can produce 1.17$\times$ (for DDlog) to 31.02$\times$ (for Souffl\'e) as many valid test cases with non-empty results as the naive random method. We believe that the simplicity and the generality of the approach will lead to its wide adoption in practice.

Related papers

EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [54.354203142828084]
We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models. We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories. Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
Constant Optimization Driven Database System Testing [6.246028398098516]
Logic bugs are bugs that can cause database management systems (DBMSs) to silently produce incorrect results for given queries. We propose Constant-Optimization-Driven Database Testing (CODDTest) as a novel approach for detecting logic bugs in databases.
arXiv Detail & Related papers (2025-01-20T03:32:55Z)
Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models [70.07661254213181]
We propose two principled algorithms for the test-time compute of large language models. We prove theoretically that the failure probability of one algorithm decays to zero exponentially as its test-time compute grows.
arXiv Detail & Related papers (2024-11-29T05:29:47Z)
Preference Optimization for Reasoning with Pseudo Feedback [100.62603571434167]
We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
arXiv Detail & Related papers (2024-11-25T12:44:02Z)
Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning [24.386388107656334]
We propose PROVE, a framework that uses program-based verification to filter out potentially incorrect reasoning paths. Instead of relying on vanilla majority voting, our approach rejects solutions whose corresponding program outputs are inconsistent with the generated solution. PROVE consistently outperforms vanilla voting as a majority for solving mathematical reasoning tasks across all datasets and model sizes.
arXiv Detail & Related papers (2024-10-16T14:24:55Z)
B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests [16.19318541132026]
We show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests. We propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge.
arXiv Detail & Related papers (2024-09-13T10:22:08Z)
Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency [15.593172556501704]
We introduce Code-Optimise, a framework that incorporates both correctness (passed, failed) and runtime as learning signals. Our framework is both lightweight and robust as it dynamically selects solutions to reduce overfitting. As a by-product, the average length of the generated solutions is reduced by up to 48% on MBPP and 23% on HumanEval.
arXiv Detail & Related papers (2024-06-18T11:05:37Z)
Easy over Hard: A Simple Baseline for Test Failures Causes Prediction [13.759493107661834]
NCChecker is a tool to automatically identify the failure causes for failed test logs. Our approach has three main stages: log abstraction, lookup table construction, and failure causes prediction. We have developed a prototype and evaluated our tool on a real-world industrial dataset with more than 10K test logs.
arXiv Detail & Related papers (2024-05-05T12:59:37Z)
Evolutionary Generative Fuzzing for Differential Testing of the Kotlin Compiler [14.259471945857431]
We investigate the effectiveness of differential testing in finding bugs within the Kotlin compilers developed at JetBrains. We propose a black-box generative approach that creates input programs for the K1 and K2 compilers. Our case study shows that the proposed approach effectively detects bugs in K1 and K2; these bugs have been confirmed and (some) fixed by JetBrains developers.
arXiv Detail & Related papers (2024-01-12T16:01:12Z)
FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization Bugs [92.47146416628965]
FuzzyFlow is a fault localization and test case extraction framework designed to test program optimizations. We leverage dataflow program representations to capture a fully reproducible system state and area-of-effect for optimizations. To reduce testing time, we design an algorithm for minimizing test inputs, trading off memory for recomputation.
arXiv Detail & Related papers (2023-06-28T13:00:17Z)
ALGO: Synthesizing Algorithmic Programs with LLM-Generated Oracle Verifiers [60.6418431624873]
Large language models (LLMs) excel at implementing code from functionality descriptions but struggle with algorithmic problems. We propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the generation and verify their correctness. Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT.
arXiv Detail & Related papers (2023-05-24T00:10:15Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases. CodeT executes the code solutions using the generated test cases, and then chooses the best solution. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
Bayesian Algorithm Execution: Estimating Computable Properties of Black-box Functions Using Mutual Information [78.78486761923855]
In many real world problems, we want to infer some property of an expensive black-box function f, given a budget of T function evaluations. We present a procedure, InfoBAX, that sequentially chooses queries that maximize mutual information with respect to the algorithm's output. On these problems, InfoBAX uses up to 500 times fewer queries to f than required by the original algorithm.
arXiv Detail & Related papers (2021-04-19T17:22:11Z)
Revisiting Bayesian Optimization in the light of the COCO benchmark [1.4467794332678539]
This article reports a large investigation about the effects on the performance of (Gaussian process based) BO of common and less common design choices. The code developed for this study makes the new version (v2.1.1) of the R package DiceOptim available on CRAN.
arXiv Detail & Related papers (2021-03-30T19:45:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.