Fugu-MT 論文翻訳(概要): A Neural Space-Time Representation for Text-to-Image Personalization

論文の概要: A Neural Space-Time Representation for Text-to-Image Personalization

arxiv url: http://arxiv.org/abs/2305.15391v1
Date: Wed, 24 May 2023 17:53:07 GMT
ステータス: 翻訳完了
システム内更新日: 2023-05-25 13:41:44.329947
Title: A Neural Space-Time Representation for Text-to-Image Personalization
Title（参考訳）: テキストから画像へのパーソナライズのためのニューラル空間時間表現
Authors: Yuval Alaluf, Elad Richardson, Gal Metzer, Daniel Cohen-Or
Abstract要約: テキスト・ツー・イメージのパーソナライズ手法の重要な側面は、生成プロセス内でターゲット概念が表現される方法である。本稿では,デノナイジングプロセスの時間ステップ(時間)とデノナイジングU-Netレイヤ(空間)の両方に依存する新しいテキストコンディショニング空間について検討する。時空表現における単一の概念は、時間と空間の組み合わせごとに数百のベクトルで構成されており、この空間を直接最適化することは困難である。
参考スコア（独自算出の注目度）: 46.772764467280986
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process. This choice greatly affects the visual fidelity, downstream editability, and disk space needed to store the learned concept. In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space) and showcase its compelling properties. A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly. Instead, we propose to implicitly represent a concept in this space by optimizing a small neural mapper that receives the current time and space parameters and outputs the matching token embedding. In doing so, the entire personalized concept is represented by the parameters of the learned mapper, resulting in a compact, yet expressive, representation. Similarly to other personalization methods, the output of our neural mapper resides in the input space of the text encoder. We observe that one can significantly improve the convergence and visual fidelity of the concept by introducing a textual bypass, where our neural mapper additionally outputs a residual that is added to the output of the text encoder. Finally, we show how one can impose an importance-based ordering over our implicit representation, providing users control over the reconstruction and editability of the learned concept using a single trained model. We demonstrate the effectiveness of our approach over a range of concepts and prompts, showing our method's ability to generate high-quality and controllable compositions without fine-tuning any parameters of the generative model itself.
Abstract（参考訳）: テキストから画像へのパーソナライズ手法の重要な側面は、対象概念が生成プロセス内で表現される方法である。この選択は、学習した概念を保存するのに必要な視覚的忠実性、下流編集性、ディスクスペースに大きく影響します。本稿では,プロセスの時間ステップ(時間)とU-Netレイヤ(空間)の両方に依存した新しいテキスト条件空間を探索し,その魅力的な特性を示す。時空表現における単一の概念は、時間と空間の組み合わせごとに数百のベクトルで構成されており、この空間を直接最適化することは困難である。代わりに、現在の時間と空間パラメータを受信し、一致するトークン埋め込みを出力する小さなニューラルマッパーを最適化することで、この分野の概念を暗黙的に表現することを提案する。そうすることで、パーソナライズされた概念全体が学習されたマッパーのパラメータによって表現され、コンパクトで表現力のある表現となる。他のパーソナライズ方法と同様に、我々のニューラルマッパーの出力はテキストエンコーダの入力空間に存在する。我々は,テキストエンコーダの出力に付加される残差をニューラルマッパーが出力するテキストバイパスを導入することにより,概念の収束性と視覚的忠実性を大幅に向上させることができることを観察した。最後に,暗黙の表現に対して重要度に基づく順序付けを課す方法を示し,学習した概念の再構成と編集性を,単一のトレーニングモデルを用いてユーザに提供する。提案手法は, 生成モデル自体のパラメータを微調整することなく, 高品質かつ制御可能な構成を生成できることを示し, 様々な概念やプロンプトに対するアプローチの有効性を示す。

論文の概要: A Neural Space-Time Representation for Text-to-Image Personalization

関連論文リスト