SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning
- URL: http://arxiv.org/abs/2510.26457v1
- Date: Thu, 30 Oct 2025 13:06:11 GMT
- Title: SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning
- Authors: Fang Liu, Simiao Liu, Yinghao Zhu, Xiaoli Lian, Li Zhang,
- Abstract summary: We propose SecureReviewer to identify and resolve security-related issues during code review.<n>We first construct a dataset tailored for training and evaluating secure code review capabilities.<n>We integrate the RAG technique, which grounds the generated comments in domain-specific security knowledge.
- Score: 8.229920162000369
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identifying and addressing security issues during the early phase of the development lifecycle is critical for mitigating the long-term negative impacts on software systems. Code review serves as an effective practice that enables developers to check their teammates' code before integration into the codebase. To streamline the generation of review comments, various automated code review approaches have been proposed, where LLM-based methods have significantly advanced the capabilities of automated review generation. However, existing models primarily focus on general-purpose code review, their effectiveness in identifying and addressing security-related issues remains underexplored. Moreover, adapting existing code review approaches to target security issues faces substantial challenges, including data scarcity and inadequate evaluation metrics. To address these limitations, we propose SecureReviewer, a new approach designed for enhancing LLMs' ability to identify and resolve security-related issues during code review. Specifically, we first construct a dataset tailored for training and evaluating secure code review capabilities. Leveraging this dataset, we fine-tune LLMs to generate code review comments that can effectively identify security issues and provide fix suggestions with our proposed secure-aware fine-tuning strategy. To mitigate hallucination in LLMs and enhance the reliability of their outputs, we integrate the RAG technique, which grounds the generated comments in domain-specific security knowledge. Additionally, we introduce SecureBLEU, a new evaluation metric designed to assess the effectiveness of review comments in addressing security issues. Experimental results demonstrate that SecureReviewer outperforms state-of-the-art baselines in both security issue detection accuracy and the overall quality and practical utility of generated review comments.
Related papers
- RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z) - Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety [59.01189713115365]
We evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases.<n>We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness.<n>We propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains.
arXiv Detail & Related papers (2026-01-12T21:08:46Z) - SeRe: A Security-Related Code Review Dataset Aligned with Real-World Review Activities [8.215547096412346]
Existing datasets and studies primarily focus on general-purpose code review comments.<n>We introduce textbfSeRe, a textbfsecurity-related code review dataset, constructed using an active learning-based ensemble classification approach.<n>We extracted 6,732 security-related reviews from 373,824 raw review instances, ensuring representativeness across multiple programming languages.
arXiv Detail & Related papers (2026-01-03T02:39:53Z) - iCodeReviewer: Improving Secure Code Review with Mixture of Prompts [5.322602557660654]
iCodeReviewer is an automated secure code review approach based on large language models (LLMs)<n>Experiment results demonstrate the effectiveness of iCodeReviewer in security issue identification and localization with an F1 of 63.98%.<n>The review comments generated by iCodeReviewer also achieve a high acceptance rate up to 84% when it is deployed in production environments.
arXiv Detail & Related papers (2025-10-14T06:30:59Z) - A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code [49.009041488527544]
A.S.E is a repository-level evaluation benchmark for assessing the security of AI-generated code.<n>Current large language models (LLMs) still struggle with secure coding.<n>A larger reasoning budget does not necessarily lead to better code generation.
arXiv Detail & Related papers (2025-08-25T15:11:11Z) - The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs [57.1838332916627]
Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP)<n>Their widespread deployment has also raised significant safety concerns.<n>LLMs-generated content can exhibit unsafe behaviors such as toxicity, bias, or misinformation, especially in adversarial contexts.
arXiv Detail & Related papers (2025-06-06T05:50:50Z) - SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code [7.209766132478914]
We introduce SafeGenBench, a benchmark specifically designed to assess the security of LLM-generated code.<n>The dataset encompasses a wide range of common software development scenarios and vulnerability types.<n>Through the empirical evaluation of state-of-the-art LLMs on SafeGenBench, we reveal notable deficiencies in their ability to produce vulnerability-free code.
arXiv Detail & Related papers (2025-06-06T02:48:02Z) - Improving Automated Secure Code Reviews: A Synthetic Dataset for Code Vulnerability Flaws [0.0]
We propose the creation of a synthetic dataset consisting of vulnerability-focused reviews that specifically comment on security flaws.<n>Our approach leverages Large Language Models (LLMs) to generate human-like code review comments for vulnerabilities.
arXiv Detail & Related papers (2025-04-22T23:07:24Z) - LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z) - CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation [20.72188827088484]
Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing.<n> detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge.<n>We introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs.
arXiv Detail & Related papers (2025-01-14T15:27:01Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.