Related papers: Improving Compiler Bug Isolation by Leveraging Large Language Models

Improving Compiler Bug Isolation by Leveraging Large Language Models

URL: http://arxiv.org/abs/2506.17647v1
Date: Sat, 21 Jun 2025 09:09:30 GMT
Title: Improving Compiler Bug Isolation by Leveraging Large Language Models
Authors: Yixian Qi, Jiajun Jiang, Fengjie Li, Bowen Chen, Hongyu Zhang, Junjie Chen,
Abstract summary: We propose an innovative compiler bug isolation approach named AutoCBI.<n>We evaluate AutoCBI against state-of-the-art approaches (DiWi, RecBi and FuseFL) on 120 real-world bugs from the widely-used GCC and LLVM compilers.<n>Specifically, AutoCBI isolates 66.67%/69.23%, 300%/340%, and 100%/57.14% more bugs than RecBi, DiWi, and FuseFL, respectively, in the Top-1 ranked results for GCC/LLVM.
Score: 14.679589768900621
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compilers play a foundational role in building reliable software systems, and bugs within them can lead to catastrophic consequences. The compilation process typically involves hundreds of files, making traditional automated bug isolation techniques inapplicable due to scalability or effectiveness issues. Current mainstream compiler bug localization techniques have limitations in test program mutation and resource consumption. Inspired by the recent advances of pre-trained Large Language Models (LLMs), we propose an innovative approach named AutoCBI, which (1) uses LLMs to summarize compiler file functions and (2) employs specialized prompts to guide LLM in reordering suspicious file rankings. This approach leverages four types of information: the failing test program, source file function summaries, lists of suspicious files identified through analyzing test coverage, as well as compilation configurations with related output messages, resulting in a refined ranking of suspicious files. Our evaluation of AutoCBI against state-of-the-art approaches (DiWi, RecBi and FuseFL) on 120 real-world bugs from the widely-used GCC and LLVM compilers demonstrates its effectiveness. Specifically, AutoCBI isolates 66.67%/69.23%, 300%/340%, and 100%/57.14% more bugs than RecBi, DiWi, and FuseFL, respectively, in the Top-1 ranked results for GCC/LLVM. Additionally, the ablation study underscores the significance of each component in our approach.

Related papers

Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z)
D-LiFT: Improving LLM-based Decompiler Backend via Code Quality-driven Fine-tuning [49.16469288280772]
We present D-LiFT, an automated decompiler backend that harnesses and trains LLMs to improve the quality of decompiled code via reinforcement learning (RL)<n>D-LiFT adheres to a key principle for enhancing the quality of decompiled code: textitpreserving accuracy while improving readability.<n>Central to D-LiFT, we propose D-SCORE, an integrated quality assessment system to score the decompiled code from multiple aspects.
arXiv Detail & Related papers (2025-06-11T19:09:08Z)
ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages.<n>This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z)
Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries [2.696054049278301]
We introduce DeBinVul, a novel decompiled binary code vulnerability dataset. We fine-tune state-of-the-art LLMs using DeBinVul and report on a performance increase of 19%, 24%, and 21% in detecting binary code vulnerabilities.
arXiv Detail & Related papers (2024-11-07T18:54:31Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [51.898805184427545]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.<n>We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language.<n>We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z)
LLM4Decompile: Decompiling Binary Code with Large Language Models [10.346311290153398]
Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results difficult to read and execute. We propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100% in terms of re-executability rate.
arXiv Detail & Related papers (2024-03-08T13:10:59Z)
LLMDFA: Analyzing Dataflow in Code with Large Language Models [8.92611389987991]
This paper presents LLMDFA, a compilation-free and customizable dataflow analysis framework. We decompose the problem into several subtasks and introduce a series of novel strategies. On average, LLMDFA achieves 87.10% precision and 80.77% recall, surpassing existing techniques with F1 score improvements of up to 0.35.
arXiv Detail & Related papers (2024-02-16T15:21:35Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
Dcc --help: Generating Context-Aware Compiler Error Explanations with Large Language Models [53.04357141450459]
dcc --help was deployed to our CS1 and CS2 courses, with 2,565 students using the tool over 64,000 times in ten weeks. We found that the LLM-generated explanations were conceptually accurate in 90% of compile-time and 75% of run-time cases, but often disregarded the instruction not to provide solutions in code.
arXiv Detail & Related papers (2023-08-23T02:36:19Z)
Isolating Compiler Bugs by Generating Effective Witness Programs with Large Language Models [10.660543763757518]
Existing compiler bug isolation approaches convert the problem into a test program mutation problem. We propose a new approach named LLM4CBI to utilize LLMs to generate effective test programs for compiler bug isolation. Compared with state-of-the-art approaches over 120 real bugs from GCC and LLVM, our evaluation demonstrates the advantages of LLM4CBI.
arXiv Detail & Related papers (2023-07-02T15:20:54Z)
Improving type information inferred by decompilers with supervised machine learning [0.0]
In software reverse engineering, decompilation is the process of recovering source code from binary files. We build different classification models capable of inferring the high-level type returned by functions. Our system is able to predict function return types with a 79.1% F1-measure, whereas the best decompiler obtains a 30% F1-measure.
arXiv Detail & Related papers (2021-01-19T11:45:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.