KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution
- URL: http://arxiv.org/abs/2407.02680v5
- Date: Tue, 12 Nov 2024 01:39:07 GMT
- Title: KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution
- Authors: Alex Mathai, Chenxi Huang, Petros Maniatis, Aleksandr Nogikh, Franjo Ivancic, Junfeng Yang, Baishakhi Ray,
- Abstract summary: Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
- Score: 59.20933707301566
- License:
- Abstract: Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. Unlike application-level software, a systems codebase like Linux is multilingual (low-level C/Assembly/Bash/Rust); gigantic (>20 million lines); critical (impacting billions of devices worldwide), and highly concurrent (involving complex multi-threading). To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym (a platform) and kBench (a dataset). The kGym platform provides a SE environment for large-scale experiments on the Linux kernel, including compiling and running kernels in parallel across several virtual machines, detecting operations and crashes, inspecting logs, and querying and patching the code base. We use kGym to facilitate evaluation on kBench, a crash resolution benchmark drawn from real-world Linux kernel bugs. An example bug in kBench contains crashing stack traces, a bug-reproducer file, a developer-written fix, and other associated data. To understand current performance, we conduct baseline experiments by prompting LLMs to resolve Linux kernel crashes. Our initial evaluations reveal that the best performing LLM achieves 0.72% and 5.38% in the unassisted and assisted (i.e., buggy files disclosed to the model) settings, respectively. These results highlight the need for further research to enhance model performance in SE tasks. Improving performance on kBench requires models to master new learning skills, including understanding the cause of crashes and repairing faults, writing memory-safe and hardware-aware code, and understanding concurrency. As a result, this work opens up multiple avenues of research at the intersection of machine learning and systems software.
Related papers
- KernelBench: Can LLMs Write Efficient GPU Kernels? [36.4117525096377]
KernelBench is an open-source framework for evaluating language models' ability to write fast and correct kernels.
We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct.
Our experiments show that frontier reasoning models perform the best out of the box but still fall short overall.
arXiv Detail & Related papers (2025-02-14T19:30:53Z) - Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency [0.0]
We evaluate various programming languages and frameworks, including PyTorch, Python, Mojo, C++, and Java.
We investigate the Mojo SDK, a novel framework designed for large language model (LLM) inference on Apple Silicon.
Our experiments, conducted on an Apple M1 Max, demonstrate Mojo SDK's competitive performance, ease of use, and seamless Python compatibility.
arXiv Detail & Related papers (2025-01-30T19:36:33Z) - SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks.
SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues.
We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z) - Investigating Memory Failure Prediction Across CPU Architectures [8.477622236186695]
We investigate the correlation between Correctable Errors (CEs) and Uncorrectable Errors (UEs) across different CPU architectures.
Our analysis identifies unique patterns of memory failure associated with each processor platform.
We conduct the memory failure prediction in different processors' platforms, achieving up to 15% improvements in F1-score compared to the existing algorithm.
arXiv Detail & Related papers (2024-06-08T05:10:23Z) - GWP-ASan: Sampling-Based Detection of Memory-Safety Bugs in Production [30.534320345970286]
heap-use-after-free and heap-buffer-overflow bugs remain the primary problem for security, reliability, and developer productivity for applications written in C or C++.
This paper describes a family of tools that detect these two classes of memory-safety bugs, while running in production, at near-zero overhead.
arXiv Detail & Related papers (2023-11-15T21:41:53Z) - SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [80.52201658231895]
SWE-bench is an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories.
We show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues.
arXiv Detail & Related papers (2023-10-10T16:47:29Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - ML-driven Hardware Cost Model for MLIR [1.2987894327817158]
We develop a machine learning-based cost model for high-level MLIR.
By considering the incoming MLIR as a text input a la NLP models we can apply well-known techniques from modern NLP research.
We show that these models can provide reasonably good estimates with low error bounds for various hardware characteristics of interest.
arXiv Detail & Related papers (2023-02-14T11:32:47Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z) - MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical
Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle.
Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.