Fugu-MT 論文翻訳(概要): Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

論文の概要: Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

arxiv url: http://arxiv.org/abs/2605.29659v1
Date: Thu, 28 May 2026 09:21:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.130392
Title: Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content
Title（参考訳）: Opir: 毒性、脱獄、ヘイトスピーチ、有害コンテンツのための効率的なマルチタスク安全分類
Authors: Ihor Stepanov, Aleksandr Smechov,
Abstract要約: Opirは、GLiClassアーキテクチャ上に構築されたエンコーダベースのガードレールモデルのファミリーである。 Opirには、バイナリセーフ/アンセーフ分類、マルチラベル毒性分類、ジェイルブレイク分類、ゼロショットアンセーフプロンプトとレスポンス分類のためのマルチタスクモデルが含まれている。
参考スコア（独自算出の注目度）: 46.13517417540154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail models, and that can distinguish benign sensitive text from genuinely covert harmful content. In this paper, we introduce Opir, a family of encoder-based guardrail models built on the GLiClass architecture. Opir includes multi-task models for binary safe/unsafe classification, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt and response categorization. We also release edge variants with fewer than 100M parameters dedicated to binary safe/unsafe categorization. The models are trained on a three-level taxonomy containing 996 categories across 16 top-level labels, 126 mid-level labels, and 854 leaf labels. Opir's training data combines taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign safety-preserving examples, generated response examples, multilingual translations, and portions of the Aegis2 and WildGuard training subsets. We also open-sourced an evaluation harness that supports GLiClass and GLiNER2 backends as well as decoder-based models, and covers binary safety classification, multi-label categorization, toxicity, jailbreak detection, prompt safety, response safety, response refusal, and prompt subcategory views across public benchmark families. Across an expanded comparison spanning 12 safety-classification tasks and 17 category tasks against eight contemporary guardrail systems -- including both GLiNER2-based and generative guardrail models -- Opir variants are competitive on or ahead of the strongest open-weight baselines on the majority of benchmark datasets while operating with a substantially smaller deployment footprint.
Abstract（参考訳）: 大規模言語モデル(LLM)アプリケーションのためのリアルタイム安全フィルタリングには、安全でないプロンプト、有害な言語、脱獄の試み、安全でない応答を、大きなガードレールモデルのコストプロファイルなしで検出できる分類器が必要である。本稿では,GLiClassアーキテクチャ上に構築されたエンコーダベースのガードレールモデルのファミリーであるOpirを紹介する。 Opirには、バイナリセーフ/アンセーフ分類、マルチラベル毒性分類、ジェイルブレイク分類、ゼロショットアンセーフプロンプトとレスポンス分類のためのマルチタスクモデルが含まれている。また、バイナリセーフ/アンセーフな分類専用のパラメータが1億未満のエッジ版もリリースしています。モデルは16の上位レベルラベル、126の中間レベルラベル、844のリーフラベルを含む3レベル分類で訓練されている。 Opirのトレーニングデータは、分類学に基づく安全でないプロンプト、反対に採掘されたハードネガティブ、良質な安全保持例、生成された応答例、多言語翻訳、およびAegis2およびWildGuardトレーニングサブセットの一部を組み合わせたものだ。また、GLiClassとGLiNER2バックエンドとデコーダベースのモデルをサポートし、バイナリ安全性分類、マルチラベル分類、毒性、ジェイルブレイク検出、迅速な安全性、応答安全性、応答拒否、パブリックベンチマークファミリ間のサブカテゴリビューをカバーしています。 12の安全分類タスクと、GLiNER2ベースとジェネレーティブガードレールモデルを含む8つの現代のガードレールシステムに対する17のカテゴリタスクにまたがる拡張された比較では、オスピ変種は、ベンチマークデータセットの大部分で最強のオープンウェイトベースラインよりも、はるかに少ないデプロイメントフットプリントで運用されている。

論文の概要: Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

関連論文リスト