Related papers: Mut4All: Fuzzing Compilers via LLM-Synthesized Mutators Learned from Bug Reports

Mut4All: Fuzzing Compilers via LLM-Synthesized Mutators Learned from Bug Reports

URL: http://arxiv.org/abs/2507.19275v1
Date: Fri, 25 Jul 2025 13:54:42 GMT
Title: Mut4All: Fuzzing Compilers via LLM-Synthesized Mutators Learned from Bug Reports
Authors: Bo Wang, Pengyang Wang, Chong Chen, Qi Sun, Jieke Shi, Chengran Yang, Ming Deng, Youfang Lin, Zhou Yang, David Lo,
Abstract summary: Mutation-based fuzzing is effective for uncovering compiler bugs, but designing high-quality mutators for modern languages remains challenging.<n>We present Mut4All, a fully automated, language-agnostic framework that synthesizes mutators using Large Language Models (LLMs) and compiler-specific knowledge from bug reports.
Score: 35.036499451862355
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mutation-based fuzzing is effective for uncovering compiler bugs, but designing high-quality mutators for modern languages with complex constructs (e.g., templates, macros) remains challenging. Existing methods rely heavily on manual design or human-in-the-loop correction, limiting scalability and cross-language generalizability. We present Mut4All, a fully automated, language-agnostic framework that synthesizes mutators using Large Language Models (LLMs) and compiler-specific knowledge from bug reports. It consists of three agents: (1) a mutator invention agent that identifies mutation targets and generates mutator metadata using compiler-related insights; (2) a mutator implementation synthesis agent, fine-tuned to produce initial implementations; and (3) a mutator refinement agent that verifies and corrects the mutators via unit-test feedback. Mut4All processes 1000 bug reports (500 Rust, 500 C++), yielding 319 Rust and 403 C++ mutators at ~$0.08 each via GPT-4o. Our customized fuzzer, using these mutators, finds 62 bugs in Rust compilers (38 new, 7 fixed) and 34 bugs in C++ compilers (16 new, 1 fixed). Mut4All outperforms existing methods in both unique crash detection and coverage, ranking first on Rust and second on C++.

Related papers

Finding Compiler Bugs through Cross-Language Code Generator and Differential Testing [4.072167151876496]
CrossLangFuzzer generates cross-language test programs with diverse type parameters and complex inheritance structures.<n>It successfully uncovered 10 confirmed bugs in the Kotlin compiler, 4 confirmed bugs in the Groovy compiler, 7 confirmed bugs in the Scala 3 compiler, 2 confirmed bugs in the Scala 2 compiler, and 1 confirmed bug in the Java compiler.
arXiv Detail & Related papers (2025-07-09T06:33:06Z)
BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis [1.9291502706655312]
We introduce BugGen, a first of its kind, fully autonomous, multi-agent pipeline to generate, insert, and validate functional bugs in RTL.<n> BugGen partitions modules, selects mutation targets via a closed-loop agentic architecture, and employs iterative refinement and rollback mechanisms.<n> evaluated across five OpenTitan IP blocks, BugGen produced 500 unique bugs with 94% functional accuracy and achieved a throughput of 17.7 validated bugs per hour-over five times faster than typical manual expert insertion.
arXiv Detail & Related papers (2025-06-12T09:02:20Z)
CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation [63.23120252801889]
CRUST-Bench is a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases.<n>We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem.<n>The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting.
arXiv Detail & Related papers (2025-04-21T17:33:33Z)
EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking [55.81461218284736]
EquiBench is a new benchmark for evaluating large language models (LLMs)<n>It determines whether two programs produce identical outputs for all possible inputs.<n>We evaluate 19 state-of-the-art LLMs and find that the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
Fuzzing MLIR Compilers with Custom Mutation Synthesis [6.617861009996863]
We develop a new test generator called SYNTHFUZZ that combines grammar-based fuzzing with custom synthesis mutation. It obviates the need to manually define custom mutation operators for each dialect. Our evaluation shows that SYNTHFUZZ on average improves MLIR dialect pair coverage by 1.75 times, which increases branch coverage by 1.22 times.
arXiv Detail & Related papers (2024-04-25T18:00:37Z)
MMT: Mutation Testing of Java Bytecode with Model Transformation -- An Illustrative Demonstration [0.11470070927586014]
mutation testing is an approach to check the robustness of test suites. We propose a model-driven approach where mutations of Java bytecode can be flexibly defined by model transformation. The corresponding tool called MMT has been extended with advanced mutation operators for modifying object-oriented structures.
arXiv Detail & Related papers (2024-04-22T11:33:21Z)
LLMorpheus: Mutation Testing using Large Language Models [5.448283690603358]
This paper presents a technique for mutation testing where placeholders are introduced at designated locations in a program's source code.<n>We find LLMorpheus to be capable of producing mutants that resemble existing bugs that cannot be produced by StrykerJS.
arXiv Detail & Related papers (2024-04-15T17:25:14Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs. We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z)
Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers. We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.