Fugu-MT 論文翻訳(概要): Many-Turn Jailbreaking

論文の概要: Many-Turn Jailbreaking

arxiv url: http://arxiv.org/abs/2508.06755v1
Date: Sat, 09 Aug 2025 00:02:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-12 21:23:28.5332
Title: Many-Turn Jailbreaking
Title（参考訳）: many‐turnjailbreaking
Authors: Xianjun Yang, Liqiang Xiao, Shiyang Li, Faisal Ladhak, Hyokun Yun, Linda Ruth Petzold, Yi Xu, William Yang Wang,
Abstract要約: そこで本研究では,JailbreakされたLLMを1つ以上のターゲットクエリで連続的にテストするマルチターンジェイルブレイクについて検討する。我々は、一連のオープンソースモデルとクローズドソースモデルでこの設定をベンチマークするために、Multi-Turn Jailbreak Benchmark (MTJ-Bench)を構築した。
参考スコア（独自算出の注目度）: 65.04921693379944
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs.
Abstract（参考訳）: 大規模言語モデル(LLM)における現在のジェイルブレイク作業は、与えられたプロンプトから安全でない出力を引き出すことを目的としている。しかし、特定のクエリをターゲットとしたシングルターンのジェイルブレイクのみに焦点を当てている。逆に、高度なLLMは、非常に長いコンテキストを扱うように設計されており、それによってマルチターン会話を行うことができる。そこで本研究では,ジェイルブレイクしたLLMを,第1ターン会話や単一ターゲットクエリ以上の連続的なテストを行うマルチターンジェイルブレイクについて検討する。これはさらに深刻な脅威です。 1)特定のジェイルブレイクの詳細を明らかにするために、利用者が引き続き関連するフォローアップ質問を行うことが一般的である。 2) 早期の脱獄は, LLM が無関係な質問に一貫した応答を誘導する可能性も考えられる。マルチターンジェイルブレイクを探索する最初のステップ(2024年6月の最初のドラフト)として、この設定を一連のオープンソースモデルとクローズドソースモデルでベンチマークするためのマルチターンジェイルブレイクベンチマーク(MTJ-Bench)を構築し、この新しい安全脅威に関する新たな洞察を提供する。この新しい脆弱性を明らかにすることで、より安全なLLMの構築と、より深いジェイルブレイクするLLMの理解の道を開くためのコミュニティの努力を呼びかけます。

論文の概要: Many-Turn Jailbreaking

関連論文リスト