Related papers: Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

URL: http://arxiv.org/abs/2601.00213v1
Date: Thu, 01 Jan 2026 05:14:32 GMT
Title: Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak
Authors: Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin,
Abstract summary: This study investigates the safety of large language models (LLMs) in automated algorithm design.<n>We introduce MalOptBench, a benchmark consisting of 60 malicious optimization algorithm requests, and propose MOBjailbreak.<n>We reveal that most models remain highly susceptible to such attacks, with an average attack success rate of 83.59% and an average harmfulness score of 4.28 out of 5 on original harmful prompts.
Score: 27.520381454182147
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The widespread deployment of large language models (LLMs) has raised growing concerns about their misuse risks and associated safety issues. While prior studies have examined the safety of LLMs in general usage, code generation, and agent-based applications, their vulnerabilities in automated algorithm design remain underexplored. To fill this gap, this study investigates this overlooked safety vulnerability, with a particular focus on intelligent optimization algorithm design, given its prevalent use in complex decision-making scenarios. We introduce MalOptBench, a benchmark consisting of 60 malicious optimization algorithm requests, and propose MOBjailbreak, a jailbreak method tailored for this scenario. Through extensive evaluation of 13 mainstream LLMs including the latest GPT-5 and DeepSeek-V3.1, we reveal that most models remain highly susceptible to such attacks, with an average attack success rate of 83.59% and an average harmfulness score of 4.28 out of 5 on original harmful prompts, and near-complete failure under MOBjailbreak. Furthermore, we assess state-of-the-art plug-and-play defenses that can be applied to closed-source models, and find that they are only marginally effective against MOBjailbreak and prone to exaggerated safety behaviors. These findings highlight the urgent need for stronger alignment techniques to safeguard LLMs against misuse in algorithm design.

Related papers

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking [23.54890959996959]
Large language models (LLMs) have revolutionized software development through AI-assisted coding tools.<n>This accessibility extends to malicious actors who may exploit these powerful tools to generate harmful software.<n>We propose SPELL, a comprehensive testing framework specifically designed to evaluate the weakness of security alignment in malicious code generation.
arXiv Detail & Related papers (2025-12-24T15:25:31Z)
OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation [94.61617176929384]
OmniSafeBench-MM is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation.<n>It integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories.<n>By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research.
arXiv Detail & Related papers (2025-12-06T22:56:29Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks [29.963044242980345]
Jailbreak attacks pose a serious threat to the safety of Large Language Models.<n>We propose SafeLLM, a novel unlearning-based defense framework.<n>We show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance.
arXiv Detail & Related papers (2025-08-21T02:39:14Z)
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [64.32925552574115]
ARMOR is a large language model that analyzes jailbreak strategies and extracts the core intent.<n> ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks.
arXiv Detail & Related papers (2025-07-14T09:05:54Z)
GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing [13.267217024192535]
Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs)<n>We introduce GuardVal, a new evaluation protocol that generates and refines jailbreak prompts based on the defender LLM's state.<n>We apply this protocol to a diverse set of models, from Mistral-7b to GPT-4, across 10 safety domains.
arXiv Detail & Related papers (2025-07-10T13:15:20Z)
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges [70.85114705489222]
We propose MalwareBench, a benchmark dataset containing 3,520 jailbreaking prompts for malicious code-generation.<n>M MalwareBench is based on 320 manually crafted malicious code generation requirements, covering 11 jailbreak methods and 29 code functionality categories.<n>Experiments show that mainstream LLMs exhibit limited ability to reject malicious code-generation requirements, and the combination of multiple jailbreak methods further reduces the model's security capabilities.
arXiv Detail & Related papers (2025-06-09T12:02:39Z)
Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses [0.5261718469769449]
Large Language Models (LLMs) are increasingly popular, powering a wide range of applications.<n>Their widespread use has sparked concerns, especially through jailbreak attacks that bypass safety measures to produce harmful content.<n>We present a comprehensive security analysis of large language models (LLMs), addressing critical research questions on the evolution and determinants of model safety.
arXiv Detail & Related papers (2025-04-02T19:33:07Z)
Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation [10.987263424166477]
Small language models (SLMs) have emerged as promising alternatives to large language models (LLMs)<n>In this paper, we conduct the first large-scale empirical study of SLMs' vulnerabilities to jailbreak attacks.<n>We identify four key factors: model size, model architecture, training datasets and training techniques.
arXiv Detail & Related papers (2025-03-09T08:47:16Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [50.980446687774645]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs.<n>This study proposes a novel method for automatically generating large-scale over-refusal datasets.<n>We introduce OR-Bench, the first large-scale over-refusal benchmark.
arXiv Detail & Related papers (2024-05-31T15:44:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.