Fugu-MT 論文翻訳(概要): Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

論文の概要: Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

arxiv url: http://arxiv.org/abs/2605.02908v1
Date: Mon, 06 Apr 2026 13:04:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 06:56:26.506582
Title: Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings
Title（参考訳）: 安定拡散における記憶はCLIP埋め込みによって予期せぬ駆動される
Authors: Bumjun Kim, Albert No,
Abstract要約: このビデオでは、エクストリームスポーツの世界で起きている最も興味深いことをいくつか紹介します。このビデオでは、エクストリームスポーツの世界で起きている最も興味深いことをいくつか紹介します。このビデオでは、世界のトップ10のスポーツをご覧ください。このビデオでは、世界のトップ10のスポーツをご覧ください。このビデオでは、世界のトップ10のスポーツをご覧ください。このビデオで世界トップ10のスポーツをご覧ください
参考スコア（独自算出の注目度）: 9.014348389153913
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding how textual embeddings contribute to memorization in text-to-image diffusion models is crucial for both interpretability and safety. This paper investigates an unexpected behavior of CLIP embeddings in Stable Diffusion, revealing that the model disproportionately relies on specific embeddings. We categorize input tokens as <startoftext>, <prompt>, <endoftext> and <pad> with corresponding embeddings $\mathbf{v}^{\mathbf{sot}}, \mathbf{v}^{\mathbf{pr}}, \mathbf{v}^{\mathbf{eot}}, \mathbf{v}^{\mathbf{pad}}$. We discover that $\mathbf{v}^{\mathbf{pr}}$ contribute minimally to generation in memorized cases. In contrast, $\mathbf{v}^{\mathbf{pad}}$ strongly affect memorization due to their structural duplication of $\mathbf{v}^{\mathbf{eot}}$, the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of $\mathbf{v}^{\mathbf{eot}}$, causing the model to over-rely on it, thereby driving memorization. Based on these observations, we propose two simple yet effective inference-time mitigation strategies: (1) Replacing the tokenizer's default <pad> from <eot> to the ! token before embedding, and masking the $\mathbf{v}^{\mathbf{eot}}$; (2) Partial masking of $\mathbf{v}^{\mathbf{pad}}$. Both suppress memorization without degrading quality, and are readily deployable without prior detection.
Abstract（参考訳）: テキストの埋め込みがテキスト間の拡散モデルにおける記憶にどのように貢献するかを理解することは、解釈可能性と安全性の両方に不可欠である。本稿では, 安定拡散におけるCLIP埋め込みの予期せぬ挙動を考察し, モデルが特定の埋め込みに依存していることを明らかにする。入力トークンを <startoftext>, <prompt>, <endoftext>, <pad> に分類し,対応する埋め込みを $\mathbf{v}^{\mathbf{sot}}, \mathbf{v}^{\mathbf{pr}}, \mathbf{v}^{\mathbf{eot}}, \mathbf{v}^{\mathbf{pad}}$ とする。記憶されたケースの生成に$\mathbf{v}^{\mathbf{pr}}$が最小限に寄与することを発見した。これとは対照的に、$\mathbf{v}^{\mathbf{pad}}$は、CLIPトレーニング中に明示的に最適化された唯一の埋め込みである$\mathbf{v}^{\mathbf{eot}}$の構造的重複のために、メモリ化に強く影響する。この重複は、意図せず$\mathbf{v}^{\mathbf{eot}}$の影響を増幅し、モデルが過度に更新され、記憶を駆動する。 1)<eot>から ! トークンに<pad>を置き換え、$\mathbf{v}^{\mathbf{eot}}$; (2)$\mathbf{v}^{\mathbf{pad}}$をマスキングする。どちらも品質を劣化させることなく記憶を抑え、事前検出なしで容易にデプロイできる。

論文の概要: Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

関連論文リスト