Related papers: Go-Oracle: Automated Test Oracle for Go Concurrency Bugs

Go-Oracle: Automated Test Oracle for Go Concurrency Bugs

URL: http://arxiv.org/abs/2412.08061v1
Date: Wed, 11 Dec 2024 03:07:56 GMT
Title: Go-Oracle: Automated Test Oracle for Go Concurrency Bugs
Authors: Foivos Tsimpourlas, Chao Peng, Carlos Rosuero, Ping Yang, Ajitha Rajan,
Abstract summary: Bugs have become a prevalent issue within the Go programming language.<n>Our work seeks to address the test oracle problem for Go programs, to automatically classify test executions as pass or fail.<n>We capture a comprehensive array of execution events using the native Go execution tracer.<n>We preprocess and encode these traces before training a transformer-based neural network to effectively classify the traces as either passing or failing.
Score: 6.773048267569272
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The Go programming language has gained significant traction for developing software, especially in various infrastructure systems. Nonetheless, concurrency bugs have become a prevalent issue within Go, presenting a unique challenge due to the language's dual concurrency mechanisms-communicating sequential processes and shared memory. Detecting concurrency bugs and accurately classifying program executions as pass or fail presents an immense challenge, even for domain experts. We conducted a survey with expert developers at Bytedance that confirmed this challenge. Our work seeks to address the test oracle problem for Go programs, to automatically classify test executions as pass or fail. This problem has not been investigated in the literature for Go programs owing to its distinctive programming model. Our approach involves collecting both passing and failing execution traces from various subject Go programs. We capture a comprehensive array of execution events using the native Go execution tracer. Subsequently, we preprocess and encode these traces before training a transformer-based neural network to effectively classify the traces as either passing or failing. The evaluation of our approach encompasses 8 subject programs sourced from the GoBench repository. These subject programs are routinely used as benchmarks in an industry setting. Encouragingly, our test oracle, Go-Oracle, demonstrates high accuracies even when operating with a limited dataset, showcasing the efficacy and potential of our methodology. Developers at Bytedance strongly agreed that they would use the Go-Oracle tool over the current practice of manual inspections to classify tests for Go programs as pass or fail.

Related papers

Execution Guided Line-by-Line Code Generation [49.1574468325115]
We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process.<n>Our method, ExecutionGuidedFree Guidance (EGCFG), incorporates execution signals as model generates code.
arXiv Detail & Related papers (2025-06-12T17:50:05Z)
SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z)
Automated Test Generation from Program Documentation Encoded in Code Comments [4.696083734269232]
This paper introduces a novel test generation technique that exploits the code-comment documentation constructively. We deliver test cases with names and oracles properly contextualized on the target behaviors.
arXiv Detail & Related papers (2025-04-29T20:23:56Z)
A Case Study on Model Checking and Runtime Verification for Awkernel [0.7199733380797578]
It is difficult for humans to manually review concurrent behaviors or to write test cases covering all possible executions. This paper proposes a development method for concurrent software, such as schedulers.
arXiv Detail & Related papers (2025-03-12T11:27:45Z)
Revisit Self-Debugging with Self-Generated Tests for Code Generation [18.643472696246686]
Self-ging with self-generated tests is a promising solution but lacks a full exploration of its limitations and practical potential. We propose two paradigms for the process: post-execution and in-execution self-ging. We find that post-execution self-ging struggles on basic problems but shows potential for improvement on competitive ones, due to the bias introduced by self-generated tests.
arXiv Detail & Related papers (2025-01-22T10:54:19Z)
Effective Technical Reviews [0.7212939068975619]
While executing a program is the ultimate test for its correctness reviewing the program can occur earlier in its development and find problems if done effectively. This work focuses on review techniques. It enables the programmer to effectively review a program and find a range of problems from to interface issues.
arXiv Detail & Related papers (2024-07-02T15:19:52Z)
Generating Exceptional Behavior Tests with Reasoning Augmented Large Language Models [41.145231237535356]
ExLong is a framework that automatically generates exceptional behavior tests. It embeds reasoning about traces that lead to throw statements, conditional expressions that guard throw statements, and non-exceptional behavior tests. We compare exLong with the state-of-the-art models for test generation (CAT-LM) and one of the strongest foundation models (GPT3.5)
arXiv Detail & Related papers (2024-05-23T14:28:41Z)
Test Oracle Automation in the era of LLMs [52.69509240442899]
Large Language Models (LLMs) have demonstrated remarkable proficiency in tackling diverse software testing tasks. This paper aims to enable discussions on the potential of using LLMs for test oracle automation, along with the challenges that may emerge during the generation of various types of oracles.
arXiv Detail & Related papers (2024-05-21T13:19:10Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual Question Answering [52.10214317661547]
Current numerical reasoning methods autoregressively decode program sequences. The accuracy of program generation drops sharply as the decoding steps unfold due to error propagation. In this paper, we propose a non-autoregressive program generation framework.
arXiv Detail & Related papers (2022-11-07T11:25:21Z)
HyperPUT: Generating Synthetic Faulty Programs to Challenge Bug-Finding Tools [3.8520163964103835]
We propose a complementary approach that automatically generates programs with seeded bugs. Our technique, called HyperPUT, builds C programs from a "seed" bug by incrementally applying program transformations.
arXiv Detail & Related papers (2022-09-14T13:09:41Z)
BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization. We provide a general benchmark with a diversity of real and synthetic Java bugs. We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z)
FixEval: Execution-based Evaluation of Program Fixes for Programming Problems [23.987104440395576]
We introduce FixEval, a benchmark comprising of buggy code submissions to competitive programming problems and their corresponding fixes. FixEval offers an extensive collection of unit tests to evaluate the correctness of model-generated program fixes. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately.
arXiv Detail & Related papers (2022-06-15T20:18:43Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
Static Prediction of Runtime Errors by Learning to Execute Programs with External Resource Descriptions [31.46148643917194]
We introduce a real-world dataset and task for predicting runtime errors. We develop an interpreter-inspired architecture with an inductive bias towards mimicking program executions. We show that the model can also predict the location of the error, despite being trained only on labels indicating the presence/absence and kind of error.
arXiv Detail & Related papers (2022-03-07T23:17:17Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.