Fugu-MT 論文翻訳(概要): A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

論文の概要: A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

arxiv url: http://arxiv.org/abs/2509.14297v1
Date: Wed, 17 Sep 2025 04:21:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-19 17:26:52.92477
Title: A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Title（参考訳）: LLMの簡便で効率的なジェイルブレイク法
Authors: Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li, Ruifeng Xu,
Abstract要約: 安全性アライメントは、LLM(Large Language Models)が有害なクエリに応答することを防ぐことを目的としている。本稿では,命令的有害な要求を学習スタイルの質問に変換する新しいジェイルブレイク手法であるHILLを紹介する。幅広いモデルにわたるAdvBenchデータセットの実験は、HILLの強い有効性、一般化可能性、有害性を示している。
参考スコア（独自算出の注目度）: 32.47621091096285
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Safety alignment aims to prevent Large Language Models (LLMs) from responding to harmful queries. To strengthen safety protections, jailbreak methods are developed to simulate malicious attacks and uncover vulnerabilities. In this paper, we introduce HILL (Hiding Intention by Learning from LLMs), a novel jailbreak approach that systematically transforms imperative harmful requests into learning-style questions with only straightforward hypotheticality indicators. Further, we introduce two new metrics to thoroughly evaluate the utility of jailbreak methods. Experiments on the AdvBench dataset across a wide range of models demonstrate HILL's strong effectiveness, generalizability, and harmfulness. It achieves top attack success rates on the majority of models and across malicious categories while maintaining high efficiency with concise prompts. Results of various defense methods show the robustness of HILL, with most defenses having mediocre effects or even increasing the attack success rates. Moreover, the assessment on our constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in defense methods. This work exposes significant vulnerabilities of safety measures against learning-style elicitation, highlighting a critical challenge of balancing helpfulness and safety alignments.
Abstract（参考訳）: 安全性アライメントは、LLM(Large Language Models)が有害なクエリに応答することを防ぐことを目的としている。セキュリティ保護を強化するため、悪意のある攻撃をシミュレートし脆弱性を明らかにするためにjailbreakメソッドが開発された。本稿では,命令的有害な要求を,単純な仮説的指標だけで学習スタイルの質問に体系的に変換する新しいジェイルブレイク手法であるHILL(Hiding Intention by Learning from LLMs)を紹介する。さらに,ジェイルブレイク手法の有用性を徹底的に評価するための2つの新しい指標を提案する。幅広いモデルにわたるAdvBenchデータセットの実験は、HILLの強い有効性、一般化可能性、有害性を示している。モデルの大半と悪意のあるカテゴリでトップアタック成功率を達成し、簡潔なプロンプトで高い効率を維持します。様々な防御方法の結果はHILLの堅牢性を示しており、ほとんどの防衛は平凡な効果を持ち、攻撃の成功率も高めている。さらに, 構築した安全プロンプトに対する評価により, LLMの安全性機構や防御方法の欠陥に固有の限界が明らかとなった。この研究は、学習スタイルの啓発に対する安全性対策の重大な脆弱性を明らかにし、有用性と安全性の整合性のバランスをとる上で重要な課題を浮き彫りにしている。

論文の概要: A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

関連論文リスト