Automatic Test-Case Reduction in Proof Assistants: A Case Study in Coq
- URL: http://arxiv.org/abs/2202.13823v2
- Date: Mon, 10 Mar 2025 18:58:36 GMT
- Title: Automatic Test-Case Reduction in Proof Assistants: A Case Study in Coq
- Authors: Jason Gross, Théo Zimmermann, Rajashree Agrawal, Adam Chlipala,
- Abstract summary: We present the Coq Bug Minimizer, a tool for reproducing buggy behavior with minimal and standalone files.<n>Our tool is integrated with coqbot to trigger automatically on Coq reverse CI failures.
- Score: 3.355947151753714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the adoption of proof assistants increases, there is a need for efficiency in identifying, documenting, and fixing compatibility issues that arise from proof assistant evolution. We present the Coq Bug Minimizer, a tool for reproducing buggy behavior with minimal and standalone files, integrated with coqbot to trigger automatically on Coq reverse CI failures. Our tool eliminates the overhead of having to download, set up, compile, and then explore and understand large developments: enabling Coq developers to easily obtain modular test-case files for fast experimentation. In this paper, we describe insights about how test-case reduction is different in Coq than in traditional compilers. We expect that our insights will generalize to other proof assistants. We evaluate the Coq Bug Minimizer on over 150 CI failures. Our tool succeeds in reducing failures to smaller test cases in roughly 75% of the time. The minimizer produces a fully standalone test case 89% of the time, and it is on average about one-third the size of the original test. The average reduced test case compiles in 1.25 seconds, with 75% taking under half a second.
Related papers
- Issue2Test: Generating Reproducing Test Cases from Issue Reports [21.28421180698285]
A crucial step toward successfully solving an issue is creating a test case that accurately reproduces the issue.
This paper presents Issue2Test, an LLM-based technique for automatically generating a reproducing test case for a given issue report.
We evaluate Issue2Test on the SWT-bench-lite dataset, where it successfully reproduces 30.4 of the issues.
arXiv Detail & Related papers (2025-03-20T16:44:00Z) - Robust Conformal Prediction with a Single Binary Certificate [58.450154976190795]
Conformal prediction (CP) converts any model's output to prediction sets with a guarantee to cover the true label with (adjustable) high probability.
We propose a robust conformal prediction that produces smaller sets even with significantly lower MC samples.
arXiv Detail & Related papers (2025-03-07T08:41:53Z) - s1: Simple test-time scaling [148.4204982041058]
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance.<n>We seek the simplest approach to achieve test-time scaling and strong reasoning performance.
arXiv Detail & Related papers (2025-01-31T18:48:08Z) - CoqPilot, a plugin for LLM-based generation of proofs [0.0]
CoqPilot is a VS Code extension designed to help automate writing of Coq proofs.
The plugin collects the parts of proofs marked with the admit tactic in a Coq file.
It combines LLMs along with non-machine-learning methods to generate proof candidates for the holes.
arXiv Detail & Related papers (2024-10-25T14:57:29Z) - CoqPyt: Proof Navigation in Python in the Era of LLMs [5.029445580644576]
This paper presents CoqPyt, a Python tool for interacting with the Coq proof assistant.
CoqPyt improves on other Coq-related tools by providing novel features, such as the extraction of rich premise data.
arXiv Detail & Related papers (2024-05-07T12:50:28Z) - Taming Timeout Flakiness: An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests negatively affect regression testing because they result in test failures that are not necessarily caused by code changes.
Test timeouts are one contributing factor to such flaky test failures.
Test flakiness rate ranges from 49% to 70%, depending on the number of repeated test executions.
arXiv Detail & Related papers (2024-02-07T20:01:41Z) - Automatic Generation of Test Cases based on Bug Reports: a Feasibility
Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications.
Most testing procedures still rely on test cases written by humans to form test suites.
We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z) - LTM: Scalable and Black-box Similarity-based Test Suite Minimization based on Language Models [0.6562256987706128]
Test suites tend to grow when software evolves, making it often infeasible to execute all test cases with the allocated testing budgets.
Test suite minimization (TSM) is employed to improve the efficiency of software testing by removing redundant test cases.
We propose LTM (Language model-based Test suite Minimization), a novel, scalable, and black-box similarity-based TSM approach.
arXiv Detail & Related papers (2023-04-03T22:16:52Z) - Reducing Variance in Temporal-Difference Value Estimation via Ensemble
of Deep Networks [109.59988683444986]
MeanQ is a simple ensemble method that estimates target values as ensemble means.
We show that MeanQ shows remarkable sample efficiency in experiments on the Atari Learning Environment benchmark.
arXiv Detail & Related papers (2022-09-16T01:47:36Z) - AdaTest:Reinforcement Learning and Adaptive Sampling for On-chip
Hardware Trojan Detection [25.593824693347113]
AdaTest is a novel adaptive test pattern generation framework for efficient and reliable Hardware Trojan (HT) detection.
To achieve high trigger coverage, AdaTest leverages Reinforcement Learning (RL) to produce a diverse set of test inputs.
AdaTest engenders up to two orders of test generation speedup and two orders of test set size reduction compared to the prior works.
arXiv Detail & Related papers (2022-04-12T23:56:59Z) - Detecting Rewards Deterioration in Episodic Reinforcement Learning [63.49923393311052]
In many RL applications, once training ends, it is vital to detect any deterioration in the agent performance as soon as possible.
We consider an episodic framework, where the rewards within each episode are not independent, nor identically-distributed, nor Markov.
We define the mean-shift in a way corresponding to deterioration of a temporal signal (such as the rewards), and derive a test for this problem with optimal statistical power.
arXiv Detail & Related papers (2020-10-22T12:45:55Z) - PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs.
Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm.
We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.