Detect Repair Verify for Securing LLM Generated Code: A Multi-Language Empirical Study
- URL: http://arxiv.org/abs/2603.00897v1
- Date: Sun, 01 Mar 2026 03:41:24 GMT
- Title: Detect Repair Verify for Securing LLM Generated Code: A Multi-Language Empirical Study
- Authors: Cheng Cheng,
- Abstract summary: Security is often addressed through a Detect--Repair--Verify (DRV) loop that detects issues, applies fixes, and verifies the result.<n>This work studies such a workflow for project-level artifacts and addresses four gaps: L1, the lack of project-level benchmarks with executable function and security tests; L2, limited evidence on pipeline-level effectiveness beyond studying detection or repair alone; L3, unclear reliability of detection reports as repair guidance; and L4, uncertain repair trustworthiness and side effects under verification.
- Score: 10.18490328199727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models are increasingly used to produce runnable software. In practice, security is often addressed through a Detect--Repair--Verify (DRV) loop that detects issues, applies fixes, and verifies the result. This work studies such a workflow for project-level artifacts and addresses four gaps: L1, the lack of project-level benchmarks with executable function and security tests; L2, limited evidence on pipeline-level effectiveness beyond studying detection or repair alone; L3, unclear reliability of detection reports as repair guidance; and L4, uncertain repair trustworthiness and side effects under verification. A new benchmark dataset\footnote{https://github.com/Hahappyppy2024/EmpricalVDR} is introduced, consisting of runnable web-application projects paired with functional tests and targeted security tests, and supporting three prompt granularities at the project, requirement, and function level. The evaluation compares generation-only, single-pass DRV, and bounded iterative DRV variants under comparable budget constraints. Outcomes are measured by secure and correct yield using test-grounded verification, and intermediate artifacts are analyzed to assess report actionability and post-repair failure modes such as regressions, semantic drift, and newly introduced security issues.
Related papers
- TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation [19.43198506241428]
We present TestExplora, a benchmark designed to evaluate Large Language Models as proactive testers.<n>TestExplora contains 2,389 tasks from 482 repositories and hides all defect-related signals.<n>Our evaluation reveals a significant capability gap: state-of-the-art models achieve a maximum Fail-to-Pass (F2P) rate of only 16.06%.
arXiv Detail & Related papers (2026-02-11T03:22:51Z) - RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z) - PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review [54.141490756509306]
We introduce PaperAudit-Bench, which consists of two components: PaperAudit-Dataset, an error dataset, and PaperAudit-Review, an automated review framework.<n>Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths.<n>We show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.
arXiv Detail & Related papers (2026-01-07T04:26:12Z) - DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems [48.971606069204825]
DoVer is an intervention-driven debug framework for large language model (LLM)-based multi-agent systems.<n>It augments hypothesis generation with active verification through targeted interventions.<n>DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses.
arXiv Detail & Related papers (2025-12-07T09:23:48Z) - VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z) - LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python [0.0]
Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development.<n>This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python.
arXiv Detail & Related papers (2025-08-22T14:30:24Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - FaultLine: Automated Proof-of-Vulnerability Generation Using LLM Agents [17.658431034176065]
FaultLine is an agent workflow that automatically generates proof-of-vulnerability (PoV) test cases.<n>It does not use language-specific static or dynamic analysis components, which enables it to be used across programming languages.<n>On a dataset of 100 known vulnerabilities in Java, C and C++ projects, FaultLine is able to generate PoV tests for 16 projects, compared to just 9 for CodeAct 2.1.
arXiv Detail & Related papers (2025-07-21T04:55:34Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - Code Change Intention, Development Artifact and History Vulnerability: Putting Them Together for Vulnerability Fix Detection by LLM [13.278153690972243]
VulFixMiner and CoLeFunDa focus solely on code changes, neglecting essential context from development artifacts.<n>We propose LLM4VFD, a novel framework that leverages Large Language Models (LLMs) enhanced with Chain-of-Thought reasoning and In-Context Learning.
arXiv Detail & Related papers (2025-01-24T23:40:03Z) - BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards [43.86118338226387]
We introduce the Benchmarks for the Evaluation of LLM Safeguards (BELLS)
BELLS is a structured collection of tests, organized into three categories: established failure tests, emerging failure tests and next-gen architecture tests.
We implement and share the first next-gen architecture test, using the MACHIAVELLI environment, along with an interactive visualization of the dataset.
arXiv Detail & Related papers (2024-06-03T14:32:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.