Fugu-MT 論文翻訳(概要): Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows

論文の概要: Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows

arxiv url: http://arxiv.org/abs/2606.06875v1
Date: Fri, 05 Jun 2026 03:43:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.549896
Title: Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows
Title（参考訳）: 非安全情報流の制限によるマルチモーダル拡散変圧器の統一型インコンテクスト画像生成
Authors: Xiang Yang, Feifei Li, Mi Zhang, Geng Hong, Xiaoyu You, Mi Wen, Min Yang,
Abstract要約: Unified Visual Safety Regulator (UVR)は、トレーニング不要の安全生成フレームワークで、生成した画像の安全でないセマンティクスを規制する。 UVRは、統一された注意調整と有害な情報の流れの明示的な制限を通じて、安全でない生成を緩和する。 UVRは画像合成および編集作業において91%と77%の消去率を達成することにより、最先端の安全性能を達成する。
参考スコア（独自算出の注目度）: 20.386952794426833
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion transformers (DiTs) equipped with multimodal attention (MM-Attn) have become a dominant paradigm for image generation. However, preventing the generation of harmful content remains a critical challenge, particularly in image-to-image (I2I) editing tasks. Existing safety mechanisms are primarily designed for text-to-image (T2I) synthesis or U-Net-based architectures, which limits their effectiveness for unified safety mitigation in DiT-based frameworks. To bridge this gap, we propose Unified Visual Safety Regulator (UVR), a training-free safe generation framework that regulates unsafe semantics in generated images. UVR is grounded in an analysis of attention dynamics from the perspective of information flow in MM-Attn. We identify a task-independent start-up stage, during which unsafe semantics in output patches rapidly emerge and can be accurately localized, followed by task-specific semantic amplification and interference stages, where harmful signals are further propagated and entangled with benign content. Based on these observations, UVR mitigates unsafe generation through unified, targeted attention modulation and explicit restriction of harmful information flow over the identified unsafe output patches. Experiments across various concepts show that UVR achieves state-of-the-art safety performance by achieving 91% and 77% erase rate in image synthesis and editing tasks, while preserving visual quality and fidelity with minimal degradation. Code is available at https://github.com/deng12yx/UVR.
Abstract（参考訳）: マルチモーダルアテンション(MM-Attn)を備えた拡散変換器(DiT)が画像生成の主流となっている。しかし、特にイメージ・ツー・イメージ(I2I)編集タスクでは、有害なコンテンツの発生を防止することが重要な課題である。既存の安全メカニズムは、主にテキスト・ツー・イメージ(T2I)合成やU-Netベースのアーキテクチャのために設計されており、DiTベースのフレームワークにおける統一安全対策の有効性を制限している。このギャップを埋めるために、生成した画像の安全でないセマンティクスを規制するトレーニング不要の安全な生成フレームワークであるUnified Visual Safety Regulator (UVR)を提案する。 UVRは、MM-Attnにおける情報フローの観点からの注意ダイナミクスの分析に基礎を置いている。出力パッチにおける安全でないセマンティクスが急速に出現し、正確な局所化が可能なタスク非依存のスタートアップステージを特定し、続いてタスク固有のセマンティクス増幅と干渉ステージを行い、有害な信号をさらに伝播し、良質な内容に絡み合わせる。これらの観測に基づいて、UVRは、統一された注意変調と、特定された安全でない出力パッチ上の有害な情報フローの明示的な制限を通じて、安全でない生成を緩和する。 UVRは画像合成と編集作業において91%と77%の消去率を達成し、視覚的品質と忠実さを最小限の劣化で保ちながら、最先端の安全性能を達成する。コードはhttps://github.com/deng12yx/UVR.comで入手できる。

論文の概要: Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows

関連論文リスト