Related papers: Evolutionary Generative Fuzzing for Differential Testing of the Kotlin Compiler

Evolutionary Generative Fuzzing for Differential Testing of the Kotlin Compiler

URL: http://arxiv.org/abs/2401.06653v1
Date: Fri, 12 Jan 2024 16:01:12 GMT
Title: Evolutionary Generative Fuzzing for Differential Testing of the Kotlin Compiler
Authors: Calin Georgescu, Mitchell Olsthoorn, Pouria Derakhshanfar, Marat Akhin, Annibale Panichella
Abstract summary: We investigate the effectiveness of differential testing in finding bugs within the Kotlin compilers developed at JetBrains. We propose a black-box generative approach that creates input programs for the K1 and K2 compilers. Our case study shows that the proposed approach effectively detects bugs in K1 and K2; these bugs have been confirmed and (some) fixed by JetBrains developers.
Score: 14.259471945857431
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compiler correctness is a cornerstone of reliable software development. However, systematic testing of compilers is infeasible, given the vast space of possible programs and the complexity of modern programming languages. In this context, differential testing offers a practical methodology as it addresses the oracle problem by comparing the output of alternative compilers given the same set of programs as input. In this paper, we investigate the effectiveness of differential testing in finding bugs within the Kotlin compilers developed at JetBrains. We propose a black-box generative approach that creates input programs for the K1 and K2 compilers. First, we build workable models of Kotlin semantic (semantic interface) and syntactic (enriched context-free grammar) language features, which are subsequently exploited to generate random code snippets. Second, we extend random sampling by introducing two genetic algorithms (GAs) that aim to generate more diverse input programs. Our case study shows that the proposed approach effectively detects bugs in K1 and K2; these bugs have been confirmed and (some) fixed by JetBrains developers. While we do not observe a significant difference w.r.t. the number of defects uncovered by the different search algorithms, random search and GAs are complementary as they find different categories of bugs. Finally, we provide insights into the relationships between the size, complexity, and fault detection capability of the generated input programs.

Related papers

Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z)
Finding Compiler Bugs through Cross-Language Code Generator and Differential Testing [4.072167151876496]
CrossLangFuzzer generates cross-language test programs with diverse type parameters and complex inheritance structures.<n>It successfully uncovered 10 confirmed bugs in the Kotlin compiler, 4 confirmed bugs in the Groovy compiler, 7 confirmed bugs in the Scala 3 compiler, 2 confirmed bugs in the Scala 2 compiler, and 1 confirmed bug in the Java compiler.
arXiv Detail & Related papers (2025-07-09T06:33:06Z)
Solsmith: Solidity Random Program Generator for Compiler Testing [8.14179966625145]
This paper designs and implements Solsmith, a test program generator aimed at uncovering defects in Solidity compilers.<n>It tests the compiler correctness by generating valid and diverse Solidity programs.<n>Preliminary results show that Solsmith can generate the expected test programs and uncover four confirmed defects in Solidity compilers.
arXiv Detail & Related papers (2025-06-04T13:04:17Z)
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data. It comprises question-solution-test triplets that are systematically validated via a self-verification procedure. This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z)
EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking [55.81461218284736]
EquiBench is a new benchmark for evaluating large language models (LLMs)<n>It determines whether two programs produce identical outputs for all possible inputs.<n>We evaluate 19 state-of-the-art LLMs and find that the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
Finding Missed Code Size Optimizations in Compilers using LLMs [1.90019787465083]
We develop a novel testing approach which combines large language models with a series of differential testing strategies. Our approach requires fewer than 150 lines of code to implement. To date we have reported 24 confirmed bugs in production compilers.
arXiv Detail & Related papers (2024-12-31T21:47:46Z)
Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification. In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction. Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z)
Galapagos: Automated N-Version Programming with LLMs [10.573037638807024]
We propose the automated generation of program variants using large language models. We design, develop and evaluate Gal'apagos: a tool for generating program variants. We evaluate Gal'apagos by creating N-Version components of real-world C code.
arXiv Detail & Related papers (2024-08-18T16:44:01Z)
AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language. We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z)
Extended Paper: API-driven Program Synthesis for Testing Static Typing Implementations [11.300829269111627]
We introduce a novel approach for testing static typing implementations based on the concept of API-driven program synthesis. The idea is to synthesize type-intensive but small and well-typed programs by leveraging and combining application programming interfaces (APIs) derived from existing software libraries.
arXiv Detail & Related papers (2023-11-08T08:32:40Z)
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs. We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z)
Directed Test Program Generation for JIT Compiler Bug Localization [3.626013617212667]
Bug localization techniques for Just-in-Time (JIT) compilers are based on analyzing the execution behaviors of the target JIT compiler on a set of test programs generated for this purpose. This paper proposes a novel technique for automatic test program generation for JIT compiler bug localization.
arXiv Detail & Related papers (2023-07-17T22:43:02Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
Natural Language to Code Translation with Execution [82.52142893010563]
Execution result--minimum Bayes risk decoding for program selection. We show that it improves the few-shot performance of pretrained code models on natural-language-to-code tasks.
arXiv Detail & Related papers (2022-04-25T06:06:08Z)
Configuring Test Generators using Bug Reports: A Case Study of GCC Compiler and Csmith [2.1016374925364616]
This paper uses the code snippets in the bug reports to guide the test generation. We evaluate this approach on eight versions of GCC. We find that our approach provides higher coverage and triggers more miscompilation failures than the state-of-the-art test generation techniques for GCC.
arXiv Detail & Related papers (2020-12-19T11:25:13Z)
Detecting and Understanding Real-World Differential Performance Bugs in Machine Learning Libraries [2.879036956042183]
We find inputs for which the performance varies widely, despite having the same size. We compare the performance of not only single inputs, but of classes of inputs, where each class has similar inputs parameterized by their size. Importantly, we also provide an explanation for why the performance differs in a form that can be readily used to fix a performance bug.
arXiv Detail & Related papers (2020-06-03T00:23:06Z)
Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler [4.179323589439977]
We apply anomaly detection to source code and bytecode to facilitate the development of a programming language. We define anomaly as a code fragment that is different from typical code written in a particular programming language.
arXiv Detail & Related papers (2020-04-03T15:20:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.