Fugu-MT 論文翻訳(概要): CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

論文の概要: CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

arxiv url: http://arxiv.org/abs/2606.15396v1
Date: Sat, 13 Jun 2026 16:57:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:33.548484
Title: CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment
Title（参考訳）: CHILLGuard: スケーラブルなデータ構築とモデル対応の優先度アライメントを備えた中国のLLM安全ガードレールを目指して
Authors: Wenbo Yu, Bohua Wang, Hao Fang, Kuofeng Gao, Jingru Zeng, Xiaochen Yang, Tianyi Zhang, Xiaoxiao Ma, Jiawei Kong, Hao Wu, Bin Chen, Shu-Tao Xia, Min Zhang,
Abstract要約: 大きな言語モデル(LLM)から生成された悪意のあるコンテンツは、深刻な安全リスクと倫理的懸念を引き起こす可能性がある。既存のLLMの安全ガードレールは英語や多言語の設定では優れているが、中国固有の規制政策、文化的文脈、言語的ニュアンスには適応していない。我々は,中国のシナリオに対して,5マクロ,31マイクロカテゴリの細粒度リスク分類を導入し,中国向けLLMコンテンツ安全ガードレールであるCHILLGuardを構築した。
参考スコア（独自算出の注目度）: 55.74660714572696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Malicious content generated from large language models (LLMs) could pose severe safety risks and ethical concerns. While existing LLM safety guardrails excel in English or multilingual settings, they lack adaptation to Chinese-specific regulatory policies, cultural context and linguistic nuances, failing to support fine-grained risk classification for diverse deployment needs. In this paper, we introduce a 5-macro, 31-micro category fine-grained risk taxonomy for Chinese scenarios, and build CHILLGuard: a dedicated Chinese LLM content safety guardrail. To address the critical scarcity of high-quality annotated Chinese safety data, we propose a scalable multi-stage data construction pipeline: we expand multi-source corpus via retrieval-augmented generation, generate implicit harmful samples through prompt engineering rewriting, and refine high-quality data via multi-model voting-based label calibration. Based on this, we build CHILLGuardTrain, a large-scale training set with 405,007 samples, and CHILLGuardTest, a rigorously curated annotated test set with 51,745 samples. We then train CHILLGuard on CHILLGuardTrain under a generator-classifier collaborative framework via Model-aware Direct Preference Optimization. Extensive experiments under multiple settings demonstrate the state-of-the-art performance of CHILLGuard, e.g., a 15.92% improvement of F1 score over Qwen3Guard-8B-Strict on our benchmark. We will release our resources at https://github.com/cswbyu/CHILLGuard.
Abstract（参考訳）: 大きな言語モデル(LLM)から生成された悪意のあるコンテンツは、深刻な安全リスクと倫理的懸念を引き起こす可能性がある。既存のLLM安全ガードレールは英語や多言語設定で優れているが、中国固有の規制政策、文化的文脈、言語的ニュアンスへの適応は欠如しており、多様な展開ニーズに対するきめ細かいリスク分類をサポートしていない。本稿では,中国のシナリオを対象とした5マクロ31マイクロカテゴリーの細粒度リスク分類を導入し,中国向けLLMコンテンツ安全ガードレールであるCHILLGuardを構築した。高品質なアノテートされた中国の安全データの重要な不足に対処するため,我々は,検索拡張生成によるマルチソースコーパスの拡大,迅速なエンジニアリングリライトによる暗黙的な有害サンプルの生成,マルチモデル投票に基づくラベルキャリブレーションによる高品質データの改良など,スケーラブルな多段階データ構築パイプラインを提案する。これに基づいて、405,007サンプルの大規模なトレーニングセットであるCHILLGuardTrainと51,745サンプルの厳格にキュレートされたアノテートテストセットであるCHILLGuardTestを構築した。次に、CHILLGuardTrain上のCHILLGuardを、モデル認識の直接参照最適化を介して、ジェネレータと分類器の協調フレームワークで訓練する。 Qwen3Guard-8B-Strictよりも15.92%のF1スコアが向上した。私たちはリソースをhttps://github.com/cswbyu/CHILLGuard.comでリリースします。

論文の概要: CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

関連論文リスト