Fugu-MT 論文翻訳(概要): Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions

論文の概要: Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions

arxiv url: http://arxiv.org/abs/2509.01444v1
Date: Mon, 01 Sep 2025 12:58:43 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.704617
Title: Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions
Title（参考訳）: Strata-Sword: ジェイルブレイク命令の推論複雑度に基づくLCMの階層的安全性評価
Authors: Shiji Zhao, Ranjie Duan, Jiexi Liu, Xiaojun Jia, Fengxiang Wang, Cheng Wei, Ruoxi Cheng, Yong Xie, Chang Liu, Qing Guo, Jialing Tao, Hui Xue, Xingxing Wei,
Abstract要約: 大規模言語モデル(LLM)と大規模推論モデル(LRM)は、ジェイルブレイク攻撃の際の潜在的な安全リスクに直面している。本稿では,まず「推論複雑度」を評価可能な安全次元として定量化し,推論複雑度に応じて15のジェイルブレイク攻撃手法を3つのレベルに分類する。まず,漢字分解攻撃,ランタン・リドル・アタック,アクロスティック・ポエム・アタックなど,独特な言語特性をフル活用するために,いくつかの中国のジェイルブレイク攻撃手法を提案する。
参考スコア（独自算出の注目度）: 46.429936395155515
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have gained widespread recognition for their superior comprehension and have been deployed across numerous domains. Building on Chain-of-Thought (CoT) ideology, Large Reasoning models (LRMs) further exhibit strong reasoning skills, enabling them to infer user intent more accurately and respond appropriately. However, both LLMs and LRMs face the potential safety risks under jailbreak attacks, which raise concerns about their safety capabilities. Current safety evaluation methods often focus on the content dimensions, or simply aggregate different attack methods, lacking consideration of the complexity. In fact, instructions of different complexity can reflect the different safety capabilities of the model: simple instructions can reflect the basic values of the model, while complex instructions can reflect the model's ability to deal with deeper safety risks. Therefore, a comprehensive benchmark needs to be established to evaluate the safety performance of the model in the face of instructions of varying complexity, which can provide a better understanding of the safety boundaries of the LLMs. Thus, this paper first quantifies "Reasoning Complexity" as an evaluable safety dimension and categorizes 15 jailbreak attack methods into three different levels according to the reasoning complexity, establishing a hierarchical Chinese-English jailbreak safety benchmark for systematically evaluating the safety performance of LLMs. Meanwhile, to fully utilize unique language characteristics, we first propose some Chinese jailbreak attack methods, including the Chinese Character Disassembly attack, Lantern Riddle attack, and Acrostic Poem attack. A series of experiments indicate that current LLMs and LRMs show different safety boundaries under different reasoning complexity, which provides a new perspective to develop safer LLMs and LRMs.
Abstract（参考訳）: 大規模言語モデル(LLM)は、優れた理解力によって広く認識され、多くのドメインに展開されてきた。 CoT(Chain-of-Thought)イデオロギーに基づいて構築されたLarge Reasoning Model (LRM)は、さらに強力な推論スキルを示し、ユーザの意図をより正確に推測し、適切な応答を可能にする。しかし LLM と LRM は、ジェイルブレイク攻撃による潜在的な安全リスクに直面しており、安全能力への懸念が高まる。現在の安全性評価手法は、しばしば内容の寸法に焦点をあてるか、複雑さを考慮せずに、単に異なる攻撃手法を集約する。単純な命令はモデルの基本的な値を反映し、複雑な命令はモデルのより深い安全性リスクに対処する能力を反映します。したがって, LLMの安全性境界をよりよく理解するために, 様々な複雑さの指示に直面して, モデルの安全性性能を評価するために, 総合的なベンチマークを確立する必要がある。そこで,本稿では,まず「推論複雑度」を評価可能な安全次元として定量化し,推論複雑さに応じて15のジェイルブレイク攻撃手法を3つのレベルに分類し,LLMの安全性性能を体系的に評価するための階層的な中国語と英語のジェイルブレイク安全ベンチマークを構築した。一方, 独特な言語特性をフル活用するために, まず, 漢字分解攻撃, ランタン・リドル攻撃, アクロスティック・ポエム攻撃などの中国のジェイルブレイク攻撃手法を提案する。一連の実験により、現在のLLMとLRMは、異なる推論複雑性の下で異なる安全性境界を示すことが示され、より安全なLLMとLRMを開発するための新たな視点を提供する。

論文の概要: Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions

関連論文リスト