Empirical Study of Code Large Language Models for Binary Security Patch Detection
- URL: http://arxiv.org/abs/2509.06052v1
- Date: Sun, 07 Sep 2025 13:31:43 GMT
- Title: Empirical Study of Code Large Language Models for Binary Security Patch Detection
- Authors: Qingyuan Li, Binchang Li, Cuiyun Gao, Shuzheng Gao, Zongjie Li,
- Abstract summary: Security patch detection (SPD) is crucial for maintaining software security, as unpatched vulnerabilities can lead to severe security risks.<n>In recent years, numerous learning-based SPD approaches have demonstrated promising results on source code.<n>However, these approaches cannot be applied to closed-source applications and proprietary systems that constitute a significant portion of real-world software.
- Score: 12.110226735365643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Security patch detection (SPD) is crucial for maintaining software security, as unpatched vulnerabilities can lead to severe security risks. In recent years, numerous learning-based SPD approaches have demonstrated promising results on source code. However, these approaches typically cannot be applied to closed-source applications and proprietary systems that constitute a significant portion of real-world software, as they release patches only with binary files, and the source code is inaccessible. Given the impressive performance of code large language models (LLMs) in code intelligence and binary analysis tasks such as decompilation and compilation optimization, their potential for detecting binary security patches remains unexplored, exposing a significant research gap between their demonstrated low-level code understanding capabilities and this critical security task. To address this gap, we construct a large-scale binary patch dataset containing \textbf{19,448} samples, with two levels of representation: assembly code and pseudo-code, and systematically evaluate \textbf{19} code LLMs of varying scales to investigate their capability in binary SPD tasks. Our initial exploration demonstrates that directly prompting vanilla code LLMs struggles to accurately identify security patches from binary patches, and even state-of-the-art prompting techniques fail to mitigate the lack of domain knowledge in binary SPD within vanilla models. Drawing on the initial findings, we further investigate the fine-tuning strategy for injecting binary SPD domain knowledge into code LLMs through two levels of representation. Experimental results demonstrate that fine-tuned LLMs achieve outstanding performance, with the best results obtained on the pseudo-code representation.
Related papers
- RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z) - StriderSPD: Structure-Guided Joint Representation Learning for Binary Security Patch Detection [22.120085662911194]
Security Patch Detection (SPD) comes to protect software assets.<n>Most SPD studies have targeted Open-Source Software (OSS), yet a large portion of real-world software is closed-source.<n>We propose textbftextitStriderSPD, a framework of binary code that integrates a graph branch into a large language model.
arXiv Detail & Related papers (2026-01-09T12:55:29Z) - Binary Diff Summarization using Large Language Models [17.877160310535942]
Large language models (LLMs) have been applied to binary analysis to augment traditional tools.<n>We propose a novel framework for binary diff summarization using LLMs.<n>We create a software supply chain security benchmark by injecting 3 different malware into 6 open-source projects.
arXiv Detail & Related papers (2025-09-28T16:47:24Z) - Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation [69.8237598448941]
This study investigates the potential of ensemble learning to enhance the performance of Large Language Models (LLMs) in source code vulnerability detection.<n>We propose Dynamic Gated Stacking (DGS), a Stacking variant tailored for vulnerability detection.
arXiv Detail & Related papers (2025-09-16T03:48:22Z) - Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench [9.229310642804036]
We present the first large-scale security analysis of LLM-generated patches using 20,000+ issues from the SWE-bench dataset.<n>We evaluate patches produced by a standalone LLM (Llama 3.3) and compare them to developer-written patches.<n>We also assess the security of patches generated by three top-performing agentic frameworks (OpenHands, AutoCodeRover, HoneyComb) on a subset of our data.
arXiv Detail & Related papers (2025-06-30T21:10:19Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - VulBinLLM: LLM-powered Vulnerability Detection for Stripped Binaries [4.1417640577742425]
Vul-BinLLM is a framework for binary vulnerability detection using Large Language Models.<n>Vul-BinLLM mirrors traditional binary analysis with fine-grained optimizations in decompilation and vulnerability reasoning with an extended context.<n>Our evaluations show that Vul-BinLLM is highly effective in detecting vulnerabilities on the compiled Juliet dataset.
arXiv Detail & Related papers (2025-05-28T06:17:56Z) - An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios.<n>Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z) - Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries [2.696054049278301]
We introduce DeBinVul, a novel decompiled binary code vulnerability dataset.
We fine-tune state-of-the-art LLMs using DeBinVul and report on a performance increase of 19%, 24%, and 21% in detecting binary code vulnerabilities.
arXiv Detail & Related papers (2024-11-07T18:54:31Z) - HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation.
Recent studies highlight that many LLM-generated code contains serious security vulnerabilities.
We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs.
Our studies reveal a new and universal safety vulnerability of these models against code input.
We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.