Related papers: LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges

LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges

URL: http://arxiv.org/abs/2506.10022v1
Date: Mon, 09 Jun 2025 12:02:39 GMT
Title: LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
Authors: Haoyang Li, Huan Gao, Zhiyuan Zhao, Zhiyu Lin, Junyu Gao, Xuelong Li,
Abstract summary: We propose MalwareBench, a benchmark dataset containing 3,520 jailbreaking prompts for malicious code-generation.<n>M MalwareBench is based on 320 manually crafted malicious code generation requirements, covering 11 jailbreak methods and 29 code functionality categories.<n>Experiments show that mainstream LLMs exhibit limited ability to reject malicious code-generation requirements, and the combination of multiple jailbreak methods further reduces the model's security capabilities.
Score: 70.85114705489222
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The widespread adoption of Large Language Models (LLMs) has heightened concerns about their security, particularly their vulnerability to jailbreak attacks that leverage crafted prompts to generate malicious outputs. While prior research has been conducted on general security capabilities of LLMs, their specific susceptibility to jailbreak attacks in code generation remains largely unexplored. To fill this gap, we propose MalwareBench, a benchmark dataset containing 3,520 jailbreaking prompts for malicious code-generation, designed to evaluate LLM robustness against such threats. MalwareBench is based on 320 manually crafted malicious code generation requirements, covering 11 jailbreak methods and 29 code functionality categories. Experiments show that mainstream LLMs exhibit limited ability to reject malicious code-generation requirements, and the combination of multiple jailbreak methods further reduces the model's security capabilities: specifically, the average rejection rate for malicious content is 60.93%, dropping to 39.92% when combined with jailbreak attack algorithms. Our work highlights that the code security capabilities of LLMs still pose significant challenges.

Related papers

JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation [22.75124155879712]
Large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose a comprehensive jailbreak defense framework, JBShield, consisting of two key components: jailbreak detection JBShield-D and mitigation JBShield-M.
arXiv Detail & Related papers (2025-02-11T13:50:50Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks. Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z)
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs [13.317364896194903]
Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner. LLMs are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs.
arXiv Detail & Related papers (2024-06-13T17:01:40Z)
JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks [24.69275959735538]
This paper investigates whether techniques that successfully jailbreak Large Language Models can be equally effective in jailbreaking MLLMs. We introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs. We generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks.
arXiv Detail & Related papers (2024-04-03T19:23:18Z)
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding [35.750885132167504]
We introduce SafeDecoding, a safety-aware decoding strategy for large language models (LLMs) to generate helpful and harmless responses to user queries. Our results show that SafeDecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries.
arXiv Detail & Related papers (2024-02-14T06:54:31Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models [50.22128133926407]
We conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies. We identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4.
arXiv Detail & Related papers (2023-08-07T16:55:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.