Fugu-MT 論文翻訳(概要): Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

論文の概要: Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

arxiv url: http://arxiv.org/abs/2509.26238v1
Date: Tue, 30 Sep 2025 13:32:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:45:00.145795
Title: Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
Title（参考訳）: 線形プローブを超えて - 言語モデルの動的安全監視
Authors: James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez,
Abstract要約: 従来の安全モニタは、クエリ毎に同じ量の計算を必要とする。動的アクティベーションモニタリングのための線形プローブの自然な拡張であるTrncated Polynomials (TPCs)を紹介する。我々の重要な洞察は、TPCを段階的に、短期的に訓練し、評価できるということです。
参考スコア（独自算出の注目度）: 67.15793594651609
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our code is available at http://github.com/james-oldfield/tpc.
Abstract（参考訳）: 大規模言語モデル(LLM)のアクティベーションの監視は、安全でない出力につながる前に有害な要求を検出する効果的な方法である。しかし、従来の安全モニターはクエリ毎に同じ量の計算を必要とすることが多い。これはトレードオフを生み出します – 高価なものは無駄なリソースを簡単な入力で監視しますが、安価なものは微妙なケースを欠くリスクがあります。安全モニタはフレキシブルでなければならない、と我々は主張する - 入力が評価しにくい場合や、より多くの計算が利用可能である場合のみ、コストが上昇するべきだ。これを実現するために、動的アクティベーション監視のための線形プローブの自然な拡張であるTrncated Polynomial Classifiers (TPCs)を導入する。私たちのキーとなる洞察は、多項式は徐々に、項ごとに訓練され、評価されるということです。テスト時には、軽量な監視のために早期停止したり、必要であればより強力なガードレールのためにより多くの用語を使用することができる。 TPCには2つのモードがある。まず、安全ダイアルとして、より多くの用語を評価することで、開発者と規制機関は同じモデルからより強力なガードレールを"購入"することができる。第2に、適応的なカスケードとして、クリアケースは低次チェック後に早期に終了し、高次ガードレールは曖昧な入力に対してのみ評価され、全体的な監視コストが削減される。 2つの大規模安全データセット(WildGuardMixとBeaverTails)において、最大30Bパラメータを持つ4つのモデルに対して、TPCが同じ大きさのMPPベースのプローブベースラインと競合するか、上回っていることを示す。私たちのコードはhttp://github.com/james-oldfield/tpc.comから入手可能です。

論文の概要: Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

関連論文リスト