StriderSPD: Structure-Guided Joint Representation Learning for Binary Security Patch Detection
- URL: http://arxiv.org/abs/2601.05772v1
- Date: Fri, 09 Jan 2026 12:55:29 GMT
- Title: StriderSPD: Structure-Guided Joint Representation Learning for Binary Security Patch Detection
- Authors: Qingyuan Li, Chenchen Yu, Chuanyi Li, Xin-Cheng Wen, Cheryl Lee, Cuiyun Gao, Bin Luo,
- Abstract summary: Security Patch Detection (SPD) comes to protect software assets.<n>Most SPD studies have targeted Open-Source Software (OSS), yet a large portion of real-world software is closed-source.<n>We propose textbftextitStriderSPD, a framework of binary code that integrates a graph branch into a large language model.
- Score: 22.120085662911194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vulnerabilities severely threaten software systems, making the timely application of security patches crucial for mitigating attacks. However, software vendors often silently patch vulnerabilities with limited disclosure, where Security Patch Detection (SPD) comes to protect software assets. Recently, most SPD studies have targeted Open-Source Software (OSS), yet a large portion of real-world software is closed-source, where patches are distributed as binaries without accessible source code. The limited binary SPD approaches often lift binaries to abstraction levels, i.e., assembly code or pseudo-code. However, assembly code is register-based instructions conveying limited semantics, while pseudo-code lacks parser-compatible grammar to extract structure, both hindering accurate vulnerability-fix representation learning. In addition, previous studies often obtain training and testing data from the same project for evaluation, which fails to reflect closed-source conditions. To alleviate the above challenges, we propose \textbf{\textit{StriderSPD}}, a \underline{Str}ucture-gu\underline{ide}d joint \underline{r}epresentation \underline{SPD} framework of binary code that integrates a graph branch into a large language model (LLM), leveraging structural information to guide the LLM in identifying security patches. Our novel design of the adapters in the graph branch effectively aligns the representations between assembly code and pseudo-code at the LLM's token level. We further present a two-stage training strategy to address the optimization imbalance caused by the large parameter disparity between StriderSPD's two branches, which enables proper branch fitting. To enable more realistic evaluation, we construct a binary SPD benchmark that is disjoint from prior datasets in both projects and domains and extensively evaluate StriderSPD on this benchmark.
Related papers
- Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model [60.60587869092729]
Large language models (LLMs) are increasingly used in software development, yet their tendency to generate insecure code remains a major barrier to real-world deployment.<n>We propose SecCoderX, an online reinforcement learning framework for functionality-preserving secure code generation.
arXiv Detail & Related papers (2026-02-07T07:42:07Z) - The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z) - Binary Diff Summarization using Large Language Models [17.877160310535942]
Large language models (LLMs) have been applied to binary analysis to augment traditional tools.<n>We propose a novel framework for binary diff summarization using LLMs.<n>We create a software supply chain security benchmark by injecting 3 different malware into 6 open-source projects.
arXiv Detail & Related papers (2025-09-28T16:47:24Z) - Empirical Study of Code Large Language Models for Binary Security Patch Detection [12.110226735365643]
Security patch detection (SPD) is crucial for maintaining software security, as unpatched vulnerabilities can lead to severe security risks.<n>In recent years, numerous learning-based SPD approaches have demonstrated promising results on source code.<n>However, these approaches cannot be applied to closed-source applications and proprietary systems that constitute a significant portion of real-world software.
arXiv Detail & Related papers (2025-09-07T13:31:43Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - VulBinLLM: LLM-powered Vulnerability Detection for Stripped Binaries [4.1417640577742425]
Vul-BinLLM is a framework for binary vulnerability detection using Large Language Models.<n>Vul-BinLLM mirrors traditional binary analysis with fine-grained optimizations in decompilation and vulnerability reasoning with an extended context.<n>Our evaluations show that Vul-BinLLM is highly effective in detecting vulnerabilities on the compiled Juliet dataset.
arXiv Detail & Related papers (2025-05-28T06:17:56Z) - Improving LLM Safety Alignment with Dual-Objective Optimization [81.98466438000086]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - Repository-Level Graph Representation Learning for Enhanced Security Patch Detection [22.039868029497942]
This paper proposes a Repository-level Security Patch Detection framework named RepoSPD.<n>RepoSPD comprises three key components: 1) a repository-level graph construction, RepoCPG, which represents software patches by merging pre-patch and post-patch source code at the repository level; 2) a structure-aware patch representation, which fuses the graph and sequence branch and aims at comprehending the relationship among multiple code changes; and 3) progressive learning, which facilitates the model in balancing semantic and structural information.
arXiv Detail & Related papers (2024-12-11T03:29:56Z) - HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation.
Recent studies highlight that many LLM-generated code contains serious security vulnerabilities.
We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z) - BinGo: Identifying Security Patches in Binary Code with Graph
Representation Learning [19.22004583230725]
We propose BinGo, a new security patch detection system for binary code.
BinGo consists of four phases, namely, patch data pre-processing, graph extraction, embedding generation, and graph representation learning.
Our experimental results show BinGo can achieve up to 80.77% accuracy in identifying security patches between two neighboring versions of binary code.
arXiv Detail & Related papers (2023-12-13T06:35:39Z) - Just-in-Time Detection of Silent Security Patches [7.840762542485285]
Security patches can be em silent, i.e., they do not always come with comprehensive advisories such as CVEs.
This lack of transparency leaves users oblivious to available security updates, providing ample opportunity for attackers to exploit unpatched vulnerabilities.
We propose to leverage large language models (LLMs) to augment patch information with generated code change explanations.
arXiv Detail & Related papers (2023-12-02T22:53:26Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.