Fugu-MT 論文翻訳(概要): SwordBench: Evaluating Orthogonality of Steering Image Representations

論文の概要: SwordBench: Evaluating Orthogonality of Steering Image Representations

arxiv url: http://arxiv.org/abs/2605.16372v1
Date: Sun, 10 May 2026 14:45:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 23:51:08.300357
Title: SwordBench: Evaluating Orthogonality of Steering Image Representations
Title（参考訳）: SwordBench: ステアリング画像表現の直交性の評価
Authors: Vladimir Zaigrajew, Dawid Pludowski, Hubert Baniecki, Przemyslaw Biecek,
Abstract要約: SwordBenchは、視覚モデルのイメージ表現をステアリングするためのベンチマークである。クロスコンセプトロバストネスは、概念検出性能の安定性を測定する。副次的損傷は下流タスクのモデル性能に不注意に影響を及ぼすかどうかを定量化する
参考スコア（独自算出の注目度）: 15.251435211656206
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.
Abstract（参考訳）: AIの解釈可能性と安全性には,推論時にモデル表現をステアリングあるいは介入することが不可欠だが,既存の評価プロトコルはあいまいな言語モデリングタスクに限定されている。このギャップに対処するために、複数のバックボーンと概念除去タスクにまたがるビジョンモデルのイメージ表現をステアリングするためのベンチマークであるSwordBenchを紹介する。統合ベンチマークスイートの他に,実用的ステアリングのための概念アクティベーションベクトル間の直交化の2次効果を明らかにする新しい評価概念を提案する。特に、クロスコンセプトロバストネスは、オルタナティブな概念に対して直交する入力間の概念検出性能の安定性を測定し、副次的ダメージは、バイアスを欠いた入力に対する下流タスクにおいて、操舵がモデル性能に不注意に影響を及ぼすかどうかを定量化する。線形支持ベクトルマシンは、分離性や直交性に優れるが、しばしばスパースオートエンコーダに追従して、左右方向の損傷をゼロにすることができない。単純なレシエーションでは、標準ベースラインと最適化ベースの手法の両方が完全なステアリングを達成できない。ソースコードは近いうちにGitHubで公開される予定だ。

論文の概要: SwordBench: Evaluating Orthogonality of Steering Image Representations

関連論文リスト