Fugu-MT 論文翻訳(概要): Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

論文の概要: Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

arxiv url: http://arxiv.org/abs/2604.07655v1
Date: Wed, 08 Apr 2026 23:47:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.599395
Title: Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Title（参考訳）: Guardian-as-an-Advisor: 信頼できるLLMのための次世代ガーディアンモデルの改善
Authors: Yue Huang, Haomin Zhuang, Jiayi Ye, Han Bao, Yanbo Wang, Hang Hua, Siyuan Wu, Pin-Yu Chen, Xiangliang Zhang,
Abstract要約: ハードゲートのセーフティチェッカーは、しばしばベンダーのモデル仕様に過度に反抗し、不平を言う。この研究は、ガーディアン・アズ・ア・アドバイザ(GaaA)というソフトゲーティングパイプラインを導入し、保護者がバイナリリスクラベルを予測し、このアドバイスを元のクエリに再推論する。全体として、GaaAはモデル仕様に従うようモデルに指示し、過度な拒絶を減らしながら安全性を維持している。
参考スコア（独自算出の注目度）: 70.81495077853673
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hard-gated safety checkers often over-refuse and misalign with a vendor's model spec; prevailing taxonomies also neglect robustness and honesty, yielding safer-on-paper yet less useful systems. This work introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline where a guardian predicts a binary risk label plus a concise explanation and prepends this advice to the original query for re-inference, keeping the base model operating under its original spec. To support training and evaluation, GuardSet is constructed, a 208k+ multi-domain dataset unifying harmful and harmless cases with targeted robustness and honesty slices. GuardAdvisor is trained via SFT followed by RL to enforce label-explanation consistency. GuardAdvisor attains competitive detection accuracy while enabling the advisory workflow; when used to augment inputs, responses improve over unaugmented prompts. A latency study shows advisor inference uses below 5% of base-model compute and adds only 2-10% end-to-end overhead under realistic harmful-input rates. Overall, GaaA steers models to comply with the model spec, maintaining safety while reducing over-refusal.
Abstract（参考訳）: ハードゲートの安全チェックは、しばしばベンダーのモデル仕様を過度に否定し、誤った扱いをする。この研究は、ガーディアン・アズ・ア・アドバイザ(GaaA)というソフトゲーティングパイプラインを紹介し、保護者がバイナリリスクラベルを予測し、簡潔な説明を行い、このアドバイスを元のクエリの再推論に優先し、ベースモデルを元の仕様の下で動作させる。トレーニングと評価をサポートするため、GuardSetは208k以上のマルチドメインデータセットを構築し、有害で有害なケースをターゲットとした堅牢性と正直なスライスで統一する。 GuardAdvisorはSFT経由でトレーニングされ、RLがラベルと説明の一貫性を強制する。 GuardAdvisorはアドバイザリワークフローを有効にしながら、競合検出の精度を向上する。遅延調査では、アドバイザ推論がベースモデルの計算の5%以下で使われており、現実的な有害なインプットレートの下では、エンド・ツー・エンドのオーバーヘッドはわずか2-10%に留まっている。全体として、GaaAはモデル仕様に準拠し、安全性を維持しながら過剰な拒絶を減らし、モデルを操縦する。

論文の概要: Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

関連論文リスト