Related papers: iCodeReviewer: Improving Secure Code Review with Mixture of Prompts

iCodeReviewer: Improving Secure Code Review with Mixture of Prompts

URL: http://arxiv.org/abs/2510.12186v1
Date: Tue, 14 Oct 2025 06:30:59 GMT
Title: iCodeReviewer: Improving Secure Code Review with Mixture of Prompts
Authors: Yun Peng, Kisub Kim, Linghan Meng, Kui Liu,
Abstract summary: iCodeReviewer is an automated secure code review approach based on large language models (LLMs)<n>Experiment results demonstrate the effectiveness of iCodeReviewer in security issue identification and localization with an F1 of 63.98%.<n>The review comments generated by iCodeReviewer also achieve a high acceptance rate up to 84% when it is deployed in production environments.
Score: 5.322602557660654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code review is an essential process to ensure the quality of software that identifies potential software issues at an early stage of software development. Among all software issues, security issues are the most important to identify, as they can easily lead to severe software crashes and service disruptions. Recent research efforts have been devoted to automated approaches to reduce the manual efforts required in the secure code review process. Despite the progress, current automated approaches on secure code review, including static analysis, deep learning models, and prompting approaches, still face the challenges of limited precision and coverage, and a lack of comprehensive evaluation. To mitigate these challenges, we propose iCodeReviewer, which is an automated secure code review approach based on large language models (LLMs). iCodeReviewer leverages a novel mixture-of-prompts architecture that incorporates many prompt experts to improve the coverage of security issues. Each prompt expert is a dynamic prompt pipeline to check the existence of a specific security issue. iCodeReviewer also implements an effective routing algorithm to activate only necessary prompt experts based on the code features in the input program, reducing the false positives induced by LLM hallucination. Experiment results in our internal dataset demonstrate the effectiveness of iCodeReviewer in security issue identification and localization with an F1 of 63.98%. The review comments generated by iCodeReviewer also achieve a high acceptance rate up to 84% when it is deployed in production environments.

Related papers

RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z)
SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning [8.229920162000369]
We propose SecureReviewer to identify and resolve security-related issues during code review.<n>We first construct a dataset tailored for training and evaluating secure code review capabilities.<n>We integrate the RAG technique, which grounds the generated comments in domain-specific security knowledge.
arXiv Detail & Related papers (2025-10-30T13:06:11Z)
CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention [68.95008546581339]
Existing decoding-time interventions, such as Contrastive Decoding, often force a severe trade-off between safety and response quality.<n>We propose CARE, a novel framework for decoding-time safety alignment that integrates three key components.<n>The framework achieves a superior balance of safety, quality, and efficiency, attaining a low harmful response rate and minimal disruption to the user experience.
arXiv Detail & Related papers (2025-09-01T04:50:02Z)
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code [49.009041488527544]
A.S.E is a repository-level evaluation benchmark for assessing the security of AI-generated code.<n>Current large language models (LLMs) still struggle with secure coding.<n>A larger reasoning budget does not necessarily lead to better code generation.
arXiv Detail & Related papers (2025-08-25T15:11:11Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective [19.345433857645016]
CoV-Eval is a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification.<n> VC-Judge is an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities.
arXiv Detail & Related papers (2025-05-15T16:53:41Z)
CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation [20.72188827088484]
Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing.<n> detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge.<n>We introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs.
arXiv Detail & Related papers (2025-01-14T15:27:01Z)
RedCode: Risky Code Execution and Generation Benchmark for Code Agents [50.81206098588923]
RedCode is a benchmark for risky code execution and generation. RedCode-Exec provides challenging prompts that could lead to risky code execution. RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions.
arXiv Detail & Related papers (2024-11-12T13:30:06Z)
INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness [110.6921470281479]
We introduce INDICT: a new framework that empowers large language models with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic. We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes.
arXiv Detail & Related papers (2024-06-23T15:55:07Z)
Toward Effective Secure Code Reviews: An Empirical Study of Security-Related Coding Weaknesses [14.134803943492345]
We conducted an empirical case study in two large open-source projects, OpenSSL and PHP. Based on 135,560 code review comments, we found that reviewers raised security concerns in 35 out of 40 coding weakness categories. Some coding weaknesses related to past vulnerabilities, such as memory errors and resource management, were discussed less often than the vulnerabilities.
arXiv Detail & Related papers (2023-11-28T00:49:00Z)
Security Defect Detection via Code Review: A Study of the OpenStack and Qt Communities [7.2944322548786715]
Security defects are not prevalently discussed in code review. More than half of the reviewers provided explicit fixing strategies/solutions to help developers fix security defects. Disagreement between the developer and the reviewer are the main causes for not resolving security defects.
arXiv Detail & Related papers (2023-07-05T14:30:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.