Related papers: Red Teaming Program Repair Agents: When Correct Patches can Hide Vulnerabilities

Red Teaming Program Repair Agents: When Correct Patches can Hide Vulnerabilities

URL: http://arxiv.org/abs/2509.25894v1
Date: Tue, 30 Sep 2025 07:38:57 GMT
Title: Red Teaming Program Repair Agents: When Correct Patches can Hide Vulnerabilities
Authors: Simin Chen, Yixin He, Suman Jana, Baishakhi Ray,
Abstract summary: We propose SWExploit, which generates adversarial issue statements to make APR agents produce patches that are functionally correct yet vulnerable.<n>Based on our evaluation, we are the first to challenge the traditional assumption that a patch passing all tests is inherently reliable and secure.
Score: 22.02073334787359
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM-based agents are increasingly deployed for software maintenance tasks such as automated program repair (APR). APR agents automatically fetch GitHub issues and use backend LLMs to generate patches that fix the reported bugs. However, existing work primarily focuses on the functional correctness of APR-generated patches, whether they pass hidden or regression tests, while largely ignoring potential security risks. Given the openness of platforms like GitHub, where any user can raise issues and participate in discussions, an important question arises: Can an adversarial user submit a valid issue on GitHub that misleads an LLM-based agent into generating a functionally correct but vulnerable patch? To answer this question, we propose SWExploit, which generates adversarial issue statements designed to make APR agents produce patches that are functionally correct yet vulnerable. SWExploit operates in three main steps: (1) program analysis to identify potential injection points for vulnerable payloads; (2) adversarial issue generation to provide misleading reproduction and error information while preserving the original issue semantics; and (3) iterative refinement of the adversarial issue statements based on the outputs of the APR agents. Empirical evaluation on three agent pipelines and five backend LLMs shows that SWExploit can produce patches that are both functionally correct and vulnerable (the attack success rate on the correct patch could reach 0.91, whereas the baseline ASRs are all below 0.20). Based on our evaluation, we are the first to challenge the traditional assumption that a patch passing all tests is inherently reliable and secure, highlighting critical limitations in the current evaluation paradigm for APR agents.

Related papers

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems [48.971606069204825]
DoVer is an intervention-driven debug framework for large language model (LLM)-based multi-agent systems.<n>It augments hypothesis generation with active verification through targeted interventions.<n>DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses.
arXiv Detail & Related papers (2025-12-07T09:23:48Z)
REFINE: Enhancing Program Repair Agents through Context-Aware Patch Refinement [12.995571513415905]
Large Language Models (LLMs) have recently shown strong potential in automatic program repair (APR)<n>LLMs often struggle to produce correct fixes due to limited understanding of code context and over-reliance on incomplete test suites.<n>We propose a novel patch refinement framework, Refine, that systematically transforms Draft Patches into correct ones.
arXiv Detail & Related papers (2025-10-04T00:34:32Z)
Abstain and Validate: A Dual-LLM Policy for Reducing Noise in Agentic Program Repair [7.118712516789191]
Agentic Automated Program Repair (APR) is increasingly tackling complex, repository-level bugs in industry.<n>Showing unlikely patches to developers can lead to substantial noise.<n>We introduce two complementary policies to reduce such noise: bug abstention and patch validation policies.
arXiv Detail & Related papers (2025-10-03T17:53:28Z)
Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
Adversarial Bug Reports as a Security Risk in Language Model-Based Automated Program Repair [1.1677624591989955]
Automated Program Repair (APR) systems are increasingly integrated into modern software development.<n>In this paper, we investigate the security risks posed by adversarial bug reports.<n>We develop a comprehensive threat model and conduct an empirical study to evaluate the vulnerability of state-of-the-art APR systems to such attacks.
arXiv Detail & Related papers (2025-09-04T09:41:57Z)
Is Your Automated Software Engineer Trustworthy? [0.850206009406913]
Large Language Models (LLMs) are being increasingly used in software engineering tasks.<n>LLMs respond to every issue and generate a patch for every case, even when the input is vague or their own output is incorrect.<n>This leads to unreliable behaviour, such as hallucinated code changes or responses based on vague issue reports.<n>We introduce BouncerBench, a benchmark that evaluates whether LLM-based software agents can refuse to act when inputs are ill-defined.
arXiv Detail & Related papers (2025-06-21T20:56:20Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [81.44934796068495]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer (QA) pairs.<n>We propose a novel textitclean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z)
Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Reasoning [58.57194301645823]
Large language models (LLMs) are increasingly integrated into real-world personalized applications.<n>The valuable and often proprietary nature of the knowledge bases used in RAG introduces the risk of unauthorized usage by adversaries.<n>Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning or backdoor attacks.<n>We propose name for harmless' copyright protection of knowledge bases.
arXiv Detail & Related papers (2025-02-10T09:15:56Z)
A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback [7.742213291781287]
We present VRpilot, a vulnerability repair technique based on reasoning and patch validation feedback. Our results show that VRpilot generates, on average, 14% and 7.6% more correct patches than the baseline techniques on C and Java.
arXiv Detail & Related papers (2024-05-24T16:29:48Z)
A Unified Debugging Approach via LLM-Based Multi-Agent Synergy [39.11825182386288]
FixAgent is an end-to-end framework for unified debug through multi-agent synergy. It significantly outperforms state-of-the-art repair methods, fixing 1.25$times$ to 2.56$times$ bugs on the repo-level benchmark, Defects4J.
arXiv Detail & Related papers (2024-04-26T04:55:35Z)
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs. We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z)
Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT [13.632199062382746]
Automated Program Repair (APR) aims to automatically generate patches for buggy programs.<n>Recent APR work has been focused on leveraging modern Large Language Models (LLMs) to directly generate patches for APR.<n>We propose ChatRepair, the first fully automated conversation-driven APR approach.
arXiv Detail & Related papers (2023-04-01T20:57:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.