Compiler Testing With Relaxed Memory Models
- URL: http://arxiv.org/abs/2310.12337v3
- Date: Mon, 29 Jan 2024 20:38:43 GMT
- Title: Compiler Testing With Relaxed Memory Models
- Authors: Luke Geeson, Lee Smith
- Abstract summary: We present the T'el'echat compiler testing tool for concurrent programs.
T'el'echat compiles a concurrent C/C++ program and compares source and compiled program behaviours.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Finding bugs is key to the correctness of compilers in wide use today. If the
behaviour of a compiled program, as allowed by its architecture memory model,
is not a behaviour of the source program under its source model, then there is
a bug. This holds for all programs, but we focus on concurrency bugs that occur
only with two or more threads of execution. We focus on testing techniques that
detect such bugs in C/C++ compilers.
We seek a testing technique that automatically covers concurrency bugs up to
fixed bounds on program sizes and that scales to find bugs in compiled programs
with many lines of code. Otherwise, a testing technique can miss bugs.
Unfortunately, the state-of-the-art techniques are yet to satisfy all of these
properties.
We present the T\'el\'echat compiler testing tool for concurrent programs.
T\'el\'echat compiles a concurrent C/C++ program and compares source and
compiled program behaviours using source and architecture memory models. We
make three claims: T\'el\'echat improves the state-of-the-art at finding bugs
in code generation for multi-threaded execution, it is the first public
description of a compiler testing tool for concurrency that is deployed in
industry, and it is the first tool that takes a significant step towards the
desired properties. We provide experimental evidence suggesting T\'el\'echat
finds bugs missed by other state-of-the-art techniques, case studies indicating
that T\'el\'echat satisfies the properties, and reports of our experience
deploying T\'el\'echat in industry regression testing.
Related papers
- KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - CITADEL: Context Similarity Based Deep Learning Framework Bug Finding [36.34154201748415]
Existing deep learning (DL) framework testing tools have limited coverage on bug types.
We propose Citadel, a method that accelerates the finding of bugs in terms of efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-18T01:51:16Z) - Evolutionary Generative Fuzzing for Differential Testing of the Kotlin
Compiler [14.259471945857431]
We investigate the effectiveness of differential testing in finding bugs within the Kotlin compilers developed at JetBrains.
We propose a black-box generative approach that creates input programs for the K1 and K2 compilers.
Our case study shows that the proposed approach effectively detects bugs in K1 and K2; these bugs have been confirmed and (some) fixed by JetBrains developers.
arXiv Detail & Related papers (2024-01-12T16:01:12Z) - Weak Memory Demands Model-based Compiler Testing [0.0]
A compiler bug arises if the behaviour of a compiled concurrent program, as allowed by its architecture memory model, is not a behaviour permitted by the source program under its source model.
We observe that processor implementations are increasingly exploiting the behaviour of relaxed architecture models.
arXiv Detail & Related papers (2024-01-12T15:50:32Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers.
We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z) - HyperPUT: Generating Synthetic Faulty Programs to Challenge Bug-Finding
Tools [3.8520163964103835]
We propose a complementary approach that automatically generates programs with seeded bugs.
Our technique, called HyperPUT, builds C programs from a "seed" bug by incrementally applying program transformations.
arXiv Detail & Related papers (2022-09-14T13:09:41Z) - BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization.
We provide a general benchmark with a diversity of real and synthetic Java bugs.
We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z) - Configuring Test Generators using Bug Reports: A Case Study of GCC
Compiler and Csmith [2.1016374925364616]
This paper uses the code snippets in the bug reports to guide the test generation.
We evaluate this approach on eight versions of GCC.
We find that our approach provides higher coverage and triggers more miscompilation failures than the state-of-the-art test generation techniques for GCC.
arXiv Detail & Related papers (2020-12-19T11:25:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.