Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs
- URL: http://arxiv.org/abs/2505.19489v1
- Date: Mon, 26 May 2025 04:15:48 GMT
- Title: Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs
- Authors: Zhenhao Zhou, Zhuochen Huang, Yike He, Chong Wang, Jiajun Wang, Yijian Wu, Xin Peng, Yiling Lou,
- Abstract summary: Fault localization (FL) aims at identifying the buggy code elements in software.<n>Recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench.<n>We introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs.
- Score: 9.986455089493779
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL$^+$, an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL$^+$ substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs. Data and code are available at https://github.com/FudanSELab/LinuxFLBench.
Related papers
- D-LiFT: Improving LLM-based Decompiler Backend via Code Quality-driven Fine-tuning [49.16469288280772]
We present D-LiFT, an automated decompiler backend that harnesses and trains LLMs to improve the quality of decompiled code via reinforcement learning (RL)<n>D-LiFT adheres to a key principle for enhancing the quality of decompiled code: textitpreserving accuracy while improving readability.<n>Central to D-LiFT, we propose D-SCORE, an integrated quality assessment system to score the decompiled code from multiple aspects.
arXiv Detail & Related papers (2025-06-11T19:09:08Z) - CrashFixer: A crash resolution agent for the Linux kernel [58.152358195983155]
This work builds upon kGym, which shares a benchmark for system-level Linux kernel bugs and a platform to run experiments on the Linux kernel.<n>This paper introduces CrashFixer, the first LLM-based software repair agent that is applicable to Linux kernel bugs.
arXiv Detail & Related papers (2025-04-29T04:18:51Z) - MigGPT: Harnessing Large Language Models for Automated Migration of Out-of-Tree Linux Kernel Patches Across Versions [24.744652237986276]
Large language models (LLMs) have shown remarkable progress across various domains.<n>MigGPT is a framework that employs a novel code fingerprint structure to retain code snippet information.
arXiv Detail & Related papers (2025-04-13T08:08:37Z) - Liger Kernel: Efficient Triton Kernels for LLM Training [6.373771349397682]
Training Large Language Models (LLMs) efficiently at scale presents a formidable challenge, driven by their ever-increasing computational demands.<n>We introduce Liger- Kernel, an open-sourced set of Triton kernels developed specifically for LLM training.<n>With kernel optimization techniques like kernel operation fusing and input chunking, our kernels achieve on average a 20% increase in training throughput and a 60% reduction in GPU memory usage.
arXiv Detail & Related papers (2024-10-14T18:17:01Z) - Impact of Large Language Models of Code on Fault Localization [2.936007114555107]
We propose a simple but effective sequence generation approach for fine-tuning large language models of code for FL tasks.
Specifically, we fine-tune representative encoder, encoder-decoder, and decoder-based 13 LLMCs for FL tasks.
Experimental results show that LLMCs fine-tuned with our approach successfully pinpoint error positions in 50.6%, 64.2%, and 72.3% of 1,291 methods in Defects4J for Top-2/3/5 prediction.
arXiv Detail & Related papers (2024-08-19T02:36:07Z) - KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models [59.84769254832941]
We propose a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp.
Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment.
Based on FLUB, we investigate the performance of multiple representative and advanced LLMs.
arXiv Detail & Related papers (2024-02-16T22:12:53Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.