White-box Compiler Fuzzing Empowered by Large Language Models
- URL: http://arxiv.org/abs/2310.15991v1
- Date: Tue, 24 Oct 2023 16:39:06 GMT
- Title: White-box Compiler Fuzzing Empowered by Large Language Models
- Authors: Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh
Jabbarvand, Lingming Zhang
- Abstract summary: We propose WhiteFox, the first white-box compiler fuzzer using Large Language Models with source-code information.
WhiteFox can generate high-quality tests to exercise deep optimizations requiring intricate conditions.
To date, WhiteFox has found in total 96 bugs, with 80 confirmed as previously unknown and 51 already fixed.
- Score: 11.826920511314336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compiler correctness is crucial, as miscompilation falsifying the program
behaviors can lead to serious consequences. In the literature, fuzzing has been
extensively studied to uncover compiler defects. However, compiler fuzzing
remains challenging: Existing arts focus on black- and grey-box fuzzing, which
generates tests without sufficient understanding of internal compiler
behaviors. As such, they often fail to construct programs to exercise
conditions of intricate optimizations. Meanwhile, traditional white-box
techniques are computationally inapplicable to the giant codebase of compilers.
Recent advances demonstrate that Large Language Models (LLMs) excel in code
generation/understanding tasks and have achieved state-of-the-art performance
in black-box fuzzing. Nonetheless, prompting LLMs with compiler source-code
information remains a missing piece of research in compiler testing.
To this end, we propose WhiteFox, the first white-box compiler fuzzer using
LLMs with source-code information to test compiler optimization. WhiteFox
adopts a dual-model framework: (i) an analysis LLM examines the low-level
optimization source code and produces requirements on the high-level test
programs that can trigger the optimization; (ii) a generation LLM produces test
programs based on the summarized requirements. Additionally,
optimization-triggering tests are used as feedback to further enhance the test
generation on the fly. Our evaluation on four popular compilers shows that
WhiteFox can generate high-quality tests to exercise deep optimizations
requiring intricate conditions, practicing up to 80 more optimizations than
state-of-the-art fuzzers. To date, WhiteFox has found in total 96 bugs, with 80
confirmed as previously unknown and 51 already fixed. Beyond compiler testing,
WhiteFox can also be adapted for white-box fuzzing of other complex, real-world
software systems in general.
Related papers
- ReFuzzer: Feedback-Driven Approach to Enhance Validity of LLM-Generated Test Programs [0.0]
Existing compiler fuzzers often produce syntactically or semantically invalid test programs.<n>We introduce ReFuzzer, a framework for refining LLM-generated test programs.<n>We evaluate ReFuzzer's effectiveness across black-, grey- and white-box fuzzing approaches targeting LLVM/Clang.
arXiv Detail & Related papers (2025-08-05T16:17:02Z) - IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - D-LiFT: Improving LLM-based Decompiler Backend via Code Quality-driven Fine-tuning [49.16469288280772]
We present D-LiFT, an automated decompiler backend that harnesses and trains LLMs to improve the quality of decompiled code via reinforcement learning (RL)<n>D-LiFT adheres to a key principle for enhancing the quality of decompiled code: textitpreserving accuracy while improving readability.<n>Central to D-LiFT, we propose D-SCORE, an integrated quality assessment system to score the decompiled code from multiple aspects.
arXiv Detail & Related papers (2025-06-11T19:09:08Z) - RAG-Based Fuzzing of Cross-Architecture Compilers [0.8302146576157498]
OneAPI is an open standard that supports cross-architecture software development with minimal effort from developers.
OneAPI brings DPC++ and C++ compilers which need to be thoroughly tested to verify their correctness, reliability, and security.
This paper proposes a large-language model (LLM)-based compiler fuzzing tool that integrates the concept of retrieval-augmented generation (RAG)
arXiv Detail & Related papers (2025-04-11T20:46:52Z) - Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors.
Traditional fuzzing struggles with the complexity and API diversity of DL libraries.
We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z) - Finding Missed Code Size Optimizations in Compilers using LLMs [1.90019787465083]
We develop a novel testing approach which combines large language models with a series of differential testing strategies.
Our approach requires fewer than 150 lines of code to implement.
To date we have reported 24 confirmed bugs in production compilers.
arXiv Detail & Related papers (2024-12-31T21:47:46Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - LLM-Vectorizer: LLM-based Verified Loop Vectorizer [12.048697450464935]
Large-language models (LLMs) can generate vectorized code from scalar programs that process individual array elements.
LLMs are capable of producing high performance vectorized code with run-time speedup ranging from 1.1x to 9.4x.
Our approach is able to verify 38.2% of vectorizations as correct on the TSVC benchmark dataset.
arXiv Detail & Related papers (2024-06-07T07:04:26Z) - IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation [3.7297002723174235]
We implement IRFuzzer to investigate the effectiveness of specialized fuzzing of the LLVM compiler backend.
The mutator in IRFuzzer is capable of generating a wide range of LLVM IR inputs, including structured control flow, vector types, and function definitions.
We show that IRFuzzer is more effective than existing fuzzers by fuzzing on 29 mature LLVM backend targets.
arXiv Detail & Related papers (2024-02-07T21:02:33Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - A Survey of Modern Compiler Fuzzing [0.0]
This survey provides a summary of the research efforts for understanding and addressing compilers defects.
It covers researchers investigation and expertise on compilers bugs, such as their symptoms and root causes.
In addition, it covers researchers efforts in designing fuzzing techniques, including constructing test programs and designing test oracles.
arXiv Detail & Related papers (2023-06-12T06:03:51Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z) - Operator Fusion in XLA: Analysis and Evaluation [1.2183405753834562]
XLA is the most common machine learning (ML) compiler.
We show how different fusion decisions in XLA are made in practice.
We implement several XLA kernel fusion strategies that can achieve up to 10.56x speedup.
arXiv Detail & Related papers (2023-01-30T17:03:33Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Coverage-Guided Tensor Compiler Fuzzing with Joint IR-Pass Mutation [20.519361342905775]
We propose Tzer, a practical fuzzing technique for the widely used TVM tensor compiler.
Our results show that Tzer substantially outperforms existing fuzzing techniques on tensor compiler testing.
To date, Tzer has detected 49 previously unknown bugs for TVM, with 37 bugs confirmed and 25 bugs fixed.
arXiv Detail & Related papers (2022-02-21T01:48:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.