A Case Study on Model Checking and Runtime Verification for Awkernel
- URL: http://arxiv.org/abs/2503.09282v1
- Date: Wed, 12 Mar 2025 11:27:45 GMT
- Title: A Case Study on Model Checking and Runtime Verification for Awkernel
- Authors: Akira Hasegawa, Ryuta Kambe, Toshiaki Aoki, Yuuki Takano,
- Abstract summary: It is difficult for humans to manually review concurrent behaviors or to write test cases covering all possible executions.<n>This paper proposes a development method for concurrent software, such as schedulers.
- Score: 0.7199733380797578
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In operating system development, concurrency poses significant challenges. It is difficult for humans to manually review concurrent behaviors or to write test cases covering all possible executions, often resulting in critical bugs. Preemption in schedulers serves as a typical example. This paper proposes a development method for concurrent software, such as schedulers. Our method incorporates model checking as an aid for tracing code, simplifying the analysis of concurrent behavior; we refer to this as model checking-assisted code review. While this approach aids in tracing behaviors, the accuracy of the results is limited because of the semantics gap between the modeling language and the programming language. Therefore, we also introduce runtime verification to address this limitation in model checking-assisted code review. We applied our approach to a real-world operating system, Awkernel, as a case study. This new operating system, currently under development for autonomous driving, is designed for preemptive task execution using asynchronous functions in Rust. After implementing our method, we identified several bugs that are difficult to detect through manual reviews or simple tests.
Related papers
- Revisit Self-Debugging with Self-Generated Tests for Code Generation [18.643472696246686]
Self-ging with self-generated tests is a promising solution but lacks a full exploration of its limitations and practical potential.<n>We propose two paradigms for the process: post-execution and in-execution self-ging.<n>We find that post-execution self-ging struggles on basic problems but shows potential for improvement on competitive ones, due to the bias introduced by self-generated tests.
arXiv Detail & Related papers (2025-01-22T10:54:19Z) - Go-Oracle: Automated Test Oracle for Go Concurrency Bugs [6.773048267569272]
Bugs have become a prevalent issue within the Go programming language.<n>Our work seeks to address the test oracle problem for Go programs, to automatically classify test executions as pass or fail.<n>We capture a comprehensive array of execution events using the native Go execution tracer.<n>We preprocess and encode these traces before training a transformer-based neural network to effectively classify the traces as either passing or failing.
arXiv Detail & Related papers (2024-12-11T03:07:56Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.<n>Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.<n> Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - CodeFlow: Program Behavior Prediction with Dynamic Dependencies Learning [11.347234752942684]
We present CodeFlow, a novel machine learning-based approach that predicts code coverage and detects runtime errors.<n>CodeFlow effectively represents all possible execution paths and the statistic relations between different statements.<n>Our empirical evaluation demonstrates that CodeFlow significantly improves code coverage prediction accuracy and effectively localizes runtime errors.
arXiv Detail & Related papers (2024-08-05T20:32:00Z) - NExT: Teaching Large Language Models to Reason about Code Execution [50.93581376646064]
Large language models (LLMs) of code are typically trained on the surface textual form of programs.
We propose NExT, a method to teach LLMs to inspect the execution traces of programs and reason about their run-time behavior.
arXiv Detail & Related papers (2024-04-23T01:46:32Z) - Improving the Learning of Code Review Successive Tasks with Cross-Task
Knowledge Distillation [1.0878040851638]
We introduce a novel deep-learning architecture, named DISCOREV, which employs cross-task knowledge distillation to address these tasks simultaneously.
We show that our approach generates better review comments, as measured by the BLEU score, as well as more accurate code refinement according to the CodeBLEU score.
arXiv Detail & Related papers (2024-02-03T07:02:22Z) - Understanding prompt engineering may not require rethinking
generalization [56.38207873589642]
We show that the discrete nature of prompts, combined with a PAC-Bayes prior given by a language model, results in generalization bounds that are remarkably tight by the standards of the literature.
This work provides a possible justification for the widespread practice of prompt engineering.
arXiv Detail & Related papers (2023-10-06T00:52:48Z) - Generative Models as a Complex Systems Science: How can we make sense of
large language model behavior? [75.79305790453654]
Coaxing out desired behavior from pretrained models, while avoiding undesirable ones, has redefined NLP.
We argue for a systematic effort to decompose language model behavior into categories that explain cross-task performance.
arXiv Detail & Related papers (2023-07-31T22:58:41Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - Execution Time Program Verification With Tight Bounds [6.36836017515443]
This paper presents a proof system for reasoning about execution time bounds for a core imperative programming language.
We define a Hoare logic for the three cases and prove its soundness with respect to an annotated cost-aware operational semantics.
The practicality of the proof system is demonstrated with an implementation in OCaml of the different modules needed to apply it to example programs.
arXiv Detail & Related papers (2022-10-20T09:02:13Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.