Related papers: Detect Repair Verify for Securing LLM Generated Code: A Multi-Language Empirical Study

Detect Repair Verify for Securing LLM Generated Code: A Multi-Language Empirical Study

URL: http://arxiv.org/abs/2603.00897v1
Date: Sun, 01 Mar 2026 03:41:24 GMT
Title: Detect Repair Verify for Securing LLM Generated Code: A Multi-Language Empirical Study
Authors: Cheng Cheng,
Abstract summary: Security is often addressed through a Detect--Repair--Verify (DRV) loop that detects issues, applies fixes, and verifies the result.<n>This work studies such a workflow for project-level artifacts and addresses four gaps: L1, the lack of project-level benchmarks with executable function and security tests; L2, limited evidence on pipeline-level effectiveness beyond studying detection or repair alone; L3, unclear reliability of detection reports as repair guidance; and L4, uncertain repair trustworthiness and side effects under verification.
Score: 10.18490328199727
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models are increasingly used to produce runnable software. In practice, security is often addressed through a Detect--Repair--Verify (DRV) loop that detects issues, applies fixes, and verifies the result. This work studies such a workflow for project-level artifacts and addresses four gaps: L1, the lack of project-level benchmarks with executable function and security tests; L2, limited evidence on pipeline-level effectiveness beyond studying detection or repair alone; L3, unclear reliability of detection reports as repair guidance; and L4, uncertain repair trustworthiness and side effects under verification. A new benchmark dataset\footnote{https://github.com/Hahappyppy2024/EmpricalVDR} is introduced, consisting of runnable web-application projects paired with functional tests and targeted security tests, and supporting three prompt granularities at the project, requirement, and function level. The evaluation compares generation-only, single-pass DRV, and bounded iterative DRV variants under comparable budget constraints. Outcomes are measured by secure and correct yield using test-grounded verification, and intermediate artifacts are analyzed to assess report actionability and post-repair failure modes such as regressions, semantic drift, and newly introduced security issues.

Related papers

TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation [19.43198506241428]
We present TestExplora, a benchmark designed to evaluate Large Language Models as proactive testers.<n>TestExplora contains 2,389 tasks from 482 repositories and hides all defect-related signals.<n>Our evaluation reveals a significant capability gap: state-of-the-art models achieve a maximum Fail-to-Pass (F2P) rate of only 16.06%.
arXiv Detail & Related papers (2026-02-11T03:22:51Z)
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z)
PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review [54.141490756509306]
We introduce PaperAudit-Bench, which consists of two components: PaperAudit-Dataset, an error dataset, and PaperAudit-Review, an automated review framework.<n>Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths.<n>We show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.
arXiv Detail & Related papers (2026-01-07T04:26:12Z)
DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems [48.971606069204825]
DoVer is an intervention-driven debug framework for large language model (LLM)-based multi-agent systems.<n>It augments hypothesis generation with active verification through targeted interventions.<n>DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses.
arXiv Detail & Related papers (2025-12-07T09:23:48Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python [0.0]
Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development.<n>This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python.
arXiv Detail & Related papers (2025-08-22T14:30:24Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
FaultLine: Automated Proof-of-Vulnerability Generation Using LLM Agents [17.658431034176065]
FaultLine is an agent workflow that automatically generates proof-of-vulnerability (PoV) test cases.<n>It does not use language-specific static or dynamic analysis components, which enables it to be used across programming languages.<n>On a dataset of 100 known vulnerabilities in Java, C and C++ projects, FaultLine is able to generate PoV tests for 16 projects, compared to just 9 for CodeAct 2.1.
arXiv Detail & Related papers (2025-07-21T04:55:34Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
Code Change Intention, Development Artifact and History Vulnerability: Putting Them Together for Vulnerability Fix Detection by LLM [13.278153690972243]
VulFixMiner and CoLeFunDa focus solely on code changes, neglecting essential context from development artifacts.<n>We propose LLM4VFD, a novel framework that leverages Large Language Models (LLMs) enhanced with Chain-of-Thought reasoning and In-Context Learning.
arXiv Detail & Related papers (2025-01-24T23:40:03Z)
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards [43.86118338226387]
We introduce the Benchmarks for the Evaluation of LLM Safeguards (BELLS) BELLS is a structured collection of tests, organized into three categories: established failure tests, emerging failure tests and next-gen architecture tests. We implement and share the first next-gen architecture test, using the MACHIAVELLI environment, along with an interactive visualization of the dataset.
arXiv Detail & Related papers (2024-06-03T14:32:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.