Fugu-MT 論文翻訳(概要): SecureBreak -- A dataset towards safe and secure models

論文の概要: SecureBreak -- A dataset towards safe and secure models

arxiv url: http://arxiv.org/abs/2603.21975v1
Date: Mon, 23 Mar 2026 13:41:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.690985
Title: SecureBreak -- A dataset towards safe and secure models
Title（参考訳）: SecureBreak -- 安全でセキュアなモデルに向けたデータセット
Authors: Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera,
Abstract要約: 本稿では、有害なLCM出力を検出するAI駆動型ソリューションの開発を支援するために設計された安全指向データセットSecureBreakを紹介する。このデータセットは、安全を確保するためにラベルを保守的に割り当てる、注意深い手動アノテーションのため、非常に信頼性が高い。トレーニング済みLLM試験ではSecureBreakを微調整した結果が改善された。
参考スコア（独自算出の注目度）: 3.797867929356259
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ``ultimate'' defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security alignment. The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety. It performs well in detecting unsafe content across multiple risk categories. Tests with pre-trained LLMs show improved results after fine-tuning on SecureBreak. Overall, the dataset is useful both for post-generation safety filtering and for guiding further model alignment and security improvements.
Abstract（参考訳）: 大規模言語モデルは、多くの現実世界のアプリケーションにおいて、広く普及しているコアコンポーネントになりつつある。その結果、セキュリティアライメントは、安全なデプロイメントにとって重要な要件である。これまでの関連する研究は主にモデルアーキテクチャとアライメント方法論に焦点を当てていたが、これらのアプローチだけでは有害な世代を完全に排除することはできない。この懸念は、ジェイルブレイクやプロンプトインジェクションのような攻撃が、既存のセキュリティアライメントメカニズムをバイパスできることを示す科学文献の増大によって強化されている。結果として、トレーニング段階で得られたセキュリティアライメントの堅牢性に関する質的なフィードバックを提供することと、デプロイされたモデルによって生成される可能性のある安全でない出力をブロックする‘ultimate’防衛層を作成するために、さらなるセキュリティ戦略が必要である。このシナリオへのコントリビューションとして、セキュリティアライメントの弱点による有害なLCM出力を検出するAI駆動型ソリューションの開発を支援するために設計された安全指向データセットであるSecureBreakを紹介する。このデータセットは、安全を確保するためにラベルを保守的に割り当てる、注意深い手動アノテーションのために、非常に信頼性が高い。それは、複数のリスクカテゴリにわたる安全でないコンテンツを検出するのにうまく機能する。トレーニング済みLLM試験ではSecureBreakを微調整した結果が改善された。全体として、データセットは、後の安全フィルタリングと、さらなるモデルアライメントとセキュリティ改善の導出の両方に有用である。

論文の概要: SecureBreak -- A dataset towards safe and secure models

関連論文リスト