Fugu-MT 論文翻訳(概要): HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

論文の概要: HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

arxiv url: http://arxiv.org/abs/2604.19274v1
Date: Tue, 21 Apr 2026 09:41:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.706349
Title: HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
Title（参考訳）: HarDBench: 安全なHuman-LLMコラボレーション記述のための、ドラフトベースの共同認証ジェイルブレーク攻撃のベンチマーク
Authors: Euntae Kim, Soomin Han, Buru Chang,
Abstract要約: 大規模言語モデル (LLMs) は共同執筆の共著者として多用されている。悪意のあるユーザーは、危険な内容の未完成のドラフトをジェイルブレイクして、有害なアウトプットを発生させるかもしれない。 HarDBenchは、この新興脅威に対するLLMの堅牢性を評価するために設計された、体系的なベンチマークである。
参考スコア（独自算出の注目度）: 7.088503833248158
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at https://github.com/untae0122/HarDBench
Abstract（参考訳）: 大規模言語モデル(LLM)は共同執筆の共著者としてますます使われており、ユーザーは大まかな草案から始めて、コンテンツを完成させ、修正し、洗練するためにLLMに依存している。悪意のあるユーザーは、危険な内容の未完成のドラフトをジェイルブレイクして、有害なアウトプットを発生させる。本稿では,このようなドラフトベースでJailbreak攻撃を共著するLLMの脆弱性を特定し,この新たな脅威に対するLLMの堅牢性を評価するための体系的ベンチマークであるHarDBenchを紹介する。 HarDBenchは、爆発物、薬品、武器、サイバー攻撃を含む、さまざまなリスクの高いドメインにまたがっており、有害な完了に対するモデルの感受性を評価するために、現実的な構造とドメイン固有の手がかりを持つ。このリスクを軽減するため、我々は、優先最適化に基づく安全ユーティリティバランスアライメントアプローチを導入し、有害な完了を防止しつつ、良質なドラフトに役立ちながらトレーニングモデルを構築した。実験結果から,既存のLCMは共著者の文脈では非常に脆弱であり,アライメント手法は共著者能力の劣化を伴わずに有害な出力を著しく低減することがわかった。これは、人間-LLM協調書き込み設定におけるLLMの評価と調整のための新しいパラダイムを示す。新しいベンチマークとデータセットは、プロジェクトページhttps://github.com/untae0122/HarDBenchで公開しています。

論文の概要: HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

関連論文リスト