Related papers: MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?

MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?

URL: http://arxiv.org/abs/2507.19598v1
Date: Fri, 25 Jul 2025 18:11:10 GMT
Title: MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?
Authors: Muntasir Wahed, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Nirav Diwan, Gang Wang, Dilek Hakkani-Tür, Ismini Lourentzou,
Abstract summary: We introduce code decomposition attacks, where a malicious coding task is broken down into seemingly benign subtasks to evade safety filters.<n>To facilitate systematic evaluation, we introduce benchmarkname, a large-scale benchmark designed to evaluate robustness of code LLMs against both single-turn and multi-turn malicious prompts.<n>Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.
Score: 12.213189431386478
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced their code generation capabilities. However, their robustness against adversarial misuse, particularly through multi-turn malicious coding prompts, remains underexplored. In this work, we introduce code decomposition attacks, where a malicious coding task is broken down into a series of seemingly benign subtasks across multiple conversational turns to evade safety filters. To facilitate systematic evaluation, we introduce \benchmarkname{}, a large-scale benchmark designed to evaluate the robustness of code LLMs against both single-turn and multi-turn malicious prompts. Empirical results across open- and closed-source models reveal persistent vulnerabilities, especially under multi-turn scenarios. Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.

Related papers

Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks [0.5218155982819203]
Large Language Models (LLMs) are increasingly used as code assistants.<n>This study examines a more direct threat: open-source LLMs generating vulnerable code when prompted.
arXiv Detail & Related papers (2025-07-14T08:36:26Z)
Adversarial Attack Classification and Robustness Testing for Large Language Models for Code [19.47426054151291]
This study investigates how adversarial perturbations in natural language inputs affect Large Language Models for Code (LLM4Code)<n>It examines the effects of perturbations at the character, word, and sentence levels to identify the most impactful vulnerabilities.
arXiv Detail & Related papers (2025-06-09T17:02:29Z)
Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation [0.3495246564946556]
Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning.<n>Do these models truly reason, or do they merely exploit shallow statistical patterns?<n>We introduce Chain-of-Code Collapse, where we investigate the robustness of reasoning LLMs by introducing a suite of semantically faithful yet adversarially structured prompt perturbations.
arXiv Detail & Related papers (2025-06-08T02:43:46Z)
CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations [36.60702578561009]
Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, yet their robustness in code comprehension and reasoning remains insufficiently explored.<n>We present CodeCrash, a comprehensive stress-testing benchmark comprising 1,279 questions from two established datasets.<n>We systematically evaluate 17 LLMs across input and output prediction tasks using direct and Chain-of-Thought prompting approaches.
arXiv Detail & Related papers (2025-04-19T00:40:28Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation [65.92001420372007]
This paper systematically evaluates state-of-the-art MLLMs across diverse benchmarks.<n>We introduce the first benchmark GaslightingBench, specifically designed to evaluate the vulnerability of MLLMs to negation arguments.
arXiv Detail & Related papers (2025-01-31T10:37:48Z)
ShadowCode: Towards (Automatic) External Prompt Injection Attack against Code LLMs [56.46702494338318]
This paper introduces a new attack paradigm: (automatic) external prompt injection against code-oriented large language models.<n>We propose ShadowCode, a simple yet effective method that automatically generates induced perturbations based on code simulation.<n>We evaluate our method across 13 distinct malicious objectives, generating 31 threat cases spanning three popular programming languages.
arXiv Detail & Related papers (2024-07-12T10:59:32Z)
An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection [17.948513691133037]
We introduce CodeBreaker, a pioneering LLM-assisted backdoor attack framework on code completion models. By integrating malicious payloads directly into the source code with minimal transformation, CodeBreaker challenges current security measures.
arXiv Detail & Related papers (2024-06-10T22:10:05Z)
M2CVD: Enhancing Vulnerability Semantic through Multi-Model Collaboration for Code Vulnerability Detection [52.4455893010468]
Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization. Code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages. This paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) to improve the detection accuracy of code models.
arXiv Detail & Related papers (2024-06-10T00:05:49Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)
Transfer Attacks and Defenses for Large Language Models on Coding Tasks [30.065641782962974]
We study the effect of adversarial perturbations on coding tasks with large language models (LLMs) We propose prompt-based defenses that involve modifying the prompt to include examples of adversarially perturbed code and explicit instructions for reversing adversarial perturbations. Our experiments show that adversarial examples obtained with a smaller code model are indeed transferable, weakening the LLMs' performance.
arXiv Detail & Related papers (2023-11-22T15:11:35Z)
Contrastive Decoding Improves Reasoning in Large Language Models [55.16503283583076]
We show that Contrastive Decoding achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks. We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark.
arXiv Detail & Related papers (2023-09-17T00:29:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.