Fugu-MT 論文翻訳(概要): Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

論文の概要: Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

arxiv url: http://arxiv.org/abs/2604.08846v1
Date: Fri, 10 Apr 2026 01:01:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.621423
Title: Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
Title（参考訳）: 辞書対応概念制御によるマルチモーダルLLMの保護
Authors: Jinqi Luo, Jinyu Yang, Tal Neiman, Lei Fan, Bing Yin, Son Tran, Mubarak Shah, René Vidal,
Abstract要約: 本稿では,MLLMアクティベーションの粒度制御を実現するために,キュレートされた概念辞書とスパースオートエンコーダ(SAE)を利用するフレームワークであるDictionary-Aligned Concept Control(DACO)を紹介する。まず,40,000以上のキャプションイメージ刺激を検索し,それらのアクティベーションを概念方向に要約することで,15,000のマルチモーダル概念の辞書をキュレートする。第2に、このキュレートされた辞書は、スパース符号化によって活性化を阻害することができることを示し、第3に、我々の辞書を用いて、SAEのトレーニングを初期化し、MLLMを保護するためにSAE原子のセマンティクスを自動的に注釈付けする新しいステアリング手法を提案する。
参考スコア（独自算出の注目度）: 89.07972282630351
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli and summarizing their activations into concept directions. We name the dataset DACO-400K. Second, we show that the curated dictionary can be used to intervene activations via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM-SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general-purpose capabilities.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、安全でない応答を誘発する悪意のあるクエリに対して脆弱であることが示されている。最近の研究は、MLLMの安全性を向上させるために、迅速なエンジニアリング、応答分類、微調整を利用している。それにもかかわらず、このようなアプローチは、しばしば悪意あるパターンの進化に対して効果がなく、クエリを再実行したり、重い計算資源を必要とすることがある。推論時に凍結モデルのアクティベーションをステアリングすることは、最近、柔軟で効果的なソリューションとして現れている。しかし、MLLMの既存のステアリング手法は、通常、限られた安全に関する概念のみを扱うか、特定の概念を他のものに影響を与えずに調整するのに苦労する。これらの課題に対処するため、我々は、MLLMアクティベーションの粒度制御を提供するために、キュレートされた概念辞書とスパースオートエンコーダ(SAE)を利用するフレームワークであるDictionary-Aligned Concept Control (DACO)を導入する。まず,40,000以上のキャプションイメージ刺激を検索し,それらのアクティベーションを概念方向に要約することで,15,000のマルチモーダル概念の辞書をキュレートする。データセットはDACO-400Kと命名する。第二に、このキュレートされた辞書はスパース符号化によるアクティベーションの介入に利用できることを示す。第3に,我々の辞書を用いてSAEのトレーニングを初期化し,MLLMの保護のためにSAE原子のセマンティクスを自動的に注釈付けする新たなステアリング手法を提案する。複数のMLLM(例えば、QwenVL、LLaVA、InternVL)の安全性ベンチマーク(例えば、MM-SafetyBench、JailBreakV)による実験では、DACOは汎用能力を維持しながらMLLMの安全性を著しく改善している。

論文の概要: Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

関連論文リスト