Patch-to-PoC: A Systematic Study of Agentic LLM Systems for Linux Kernel N-Day Reproduction
- URL: http://arxiv.org/abs/2602.07287v1
- Date: Sat, 07 Feb 2026 00:34:08 GMT
- Title: Patch-to-PoC: A Systematic Study of Agentic LLM Systems for Linux Kernel N-Day Reproduction
- Authors: Juefei Pu, Xingyu Li, Haonan Li, Zhengchuan Liang, Jonathan Cox, Yifan Wu, Kareem Shehada, Arrdya Srivastav, Zhiyun Qian,
- Abstract summary: We present the first large-scale study of LLM-based Linux kernel vulnerability reproduction.<n>Using kernel security patches as input, K-Repro automates end-to-end bug reproduction of N-day vulnerabilities in the Linux kernel.<n>Our results show that K-Repro can generate PoCs that reproduce over 50% of the cases with practical time and monetary cost.
- Score: 27.460244103362935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous large language model (LLM) based systems have recently shown promising results across a range of cybersecurity tasks. However, there is no systematic study on their effectiveness in autonomously reproducing Linux kernel vulnerabilities with concrete proofs-of-concept (PoCs). Owing to the size, complexity, and low-level nature of the Linux kernel, such tasks are widely regarded as particularly challenging for current LLM-based approaches. In this paper, we present the first large-scale study of LLM-based Linux kernel vulnerability reproduction. For this purpose, we develop K-Repro, an LLM-based agentic system equipped with controlled code-browsing, virtual machine management, interaction, and debugging capabilities. Using kernel security patches as input, K-Repro automates end-to-end bug reproduction of N-day vulnerabilities in the Linux kernel. On a dataset of 100 real-world exploitable Linux kernel vulnerabilities collected from KernelCTF, our results show that K-Repro can generate PoCs that reproduce over 50\% of the cases with practical time and monetary cost. Beyond aggregate success rates, we perform an extensive study of effectiveness, efficiency, stability, and impact factors to explain when agentic reproduction succeeds, where it fails, and which components drive performance. These findings provide actionable guidance for building more reliable autonomous security agents and for assessing real-world N-day risk from both offensive and defensive perspectives.
Related papers
- Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All [57.23434868678603]
Live-kBench is an evaluation framework for self-evolving benchmarks that scrapes and evaluates agents on freshly discovered kernel bugs.<n> kEnv is an agent-agnostic crash-resolution environment for kernel compilation, execution, and feedback.<n>Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt.
arXiv Detail & Related papers (2026-02-02T19:06:15Z) - RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z) - AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units [39.846358001824996]
We propose Ascend KernelGen, a generation-evaluation integrated framework for NPU kernel development.<n>We introduce Ascend-CoT, a high-quality dataset incorporating chain-of-thought reasoning derived from real-world kernel implementations.<n>We also design NPU KernelBench, a comprehensive benchmark for assessing compilation, correctness, and performance across varying complexity levels.
arXiv Detail & Related papers (2026-01-12T03:12:58Z) - Rethinking Provenance Completeness with a Learning-Based Linux Scheduler [23.33056415010496]
Provenance plays a critical role in maintaining traceability of a system's actions for root cause analysis of security threats and impacts.<n>Recent research has questioned whether existing provenance collection systems fail to ensure the security guarantees of a true reference monitor.<n>We introduce Aegis, a scheduler for Linux specifically designed for provenance.
arXiv Detail & Related papers (2025-10-09T17:18:50Z) - Automated Vulnerability Validation and Verification: A Large Language Model Approach [7.482522010482827]
This paper introduces an end-to-end multi-step pipeline leveraging generative AI, specifically large language models (LLMs)<n>Our approach extracts information from CVE disclosures in the National Vulnerability Database.<n>It augments it with external public knowledge (e.g., threat advisories, code snippets) using Retrieval-Augmented Generation (RAG)<n>The pipeline iteratively refines generated artifacts, validates attack success with test cases, and supports complex multi-container setups.
arXiv Detail & Related papers (2025-09-28T19:16:12Z) - Evolution of Kernels: Automated RISC-V Kernel Optimization with Large Language Models [26.985412258634256]
Large language models (LLMs) have shown promise for automated kernel optimization, demonstrating success in domains with comprehensive technical documents and mature scarcitys.<n>We present Evolution of Kernels (EoK), a novel LLM-based evolutionary program search framework that automates kernel design for domains with limited reference material.<n>EoK achieves a median 1.27x speedup, surpassing human experts on all 80 evaluated kernel design tasks.
arXiv Detail & Related papers (2025-09-14T08:11:06Z) - Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs [9.986455089493779]
Fault localization (FL) aims at identifying the buggy code elements in software.<n>Recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench.<n>We introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs.
arXiv Detail & Related papers (2025-05-26T04:15:48Z) - Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search [76.54475437069395]
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information.<n>We propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior.
arXiv Detail & Related papers (2025-02-03T18:43:36Z) - The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z) - AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z) - Evaluating Model-free Reinforcement Learning toward Safety-critical
Tasks [70.76757529955577]
This paper revisits prior work in this scope from the perspective of state-wise safe RL.
We propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection.
To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit.
arXiv Detail & Related papers (2022-12-12T06:30:17Z) - Dos and Don'ts of Machine Learning in Computer Security [74.1816306998445]
Despite great potential, machine learning in security is prone to subtle pitfalls that undermine its performance.
We identify common pitfalls in the design, implementation, and evaluation of learning-based security systems.
We propose actionable recommendations to support researchers in avoiding or mitigating the pitfalls where possible.
arXiv Detail & Related papers (2020-10-19T13:09:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.