Fugu-MT 論文翻訳(概要): A Creative Agent is Worth a 64-Token Template

論文の概要: A Creative Agent is Worth a 64-Token Template

arxiv url: http://arxiv.org/abs/2603.17895v1
Date: Wed, 18 Mar 2026 16:25:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.820833
Title: A Creative Agent is Worth a 64-Token Template
Title（参考訳）: クリエイティブエージェントは64-tokenテンプレートの価値がある
Authors: Ruixiao Shi, Fu Feng, Yucheng Xie, Xu Yang, Jing Wang, Xin Geng,
Abstract要約: テキスト・トゥ・イメージ(T2I)モデルは画像の忠実度を大幅に改善し、定着を早めたが、それらの創造性は独立した自然言語のプロンプトに依存している。 textbfAgent textbfTokenizationのフレームワークであるtextbfCATを紹介した。
参考スコア（独自算出の注目度）: 31.988429473627594
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image (T2I) models have substantially improved image fidelity and prompt adherence, yet their creativity remains constrained by reliance on discrete natural language prompts. When presented with fuzzy prompts such as ``a creative vinyl record-inspired skyscraper'', these models often fail to infer the underlying creative intent, leaving creative ideation and prompt design largely to human users. Recent reasoning- or agent-driven approaches iteratively augment prompts but incur high computational and monetary costs, as their instance-specific generation makes ``creativity'' costly and non-reusable, requiring repeated queries or reasoning for subsequent generations. To address this, we introduce \textbf{CAT}, a framework for \textbf{C}reative \textbf{A}gent \textbf{T}okenization that encapsulates agents' intrinsic understanding of ``creativity'' through a \textit{Creative Tokenizer}. Given the embeddings of fuzzy prompts, the tokenizer generates a reusable token template that can be directly concatenated with them to inject creative semantics into T2I models without repeated reasoning or prompt augmentation. To enable this, the tokenizer is trained via creative semantic disentanglement, leveraging relations among partially overlapping concept pairs to capture the agent's latent creative representations. Extensive experiments on \textbf{\textit{Architecture Design}}, \textbf{\textit{Furniture Design}}, and \textbf{\textit{Nature Mixture}} tasks demonstrate that CAT provides a scalable and effective paradigm for enhancing creativity in T2I generation, achieving a $3.7\times$ speedup and a $4.8\times$ reduction in computational cost, while producing images with superior human preference and text-image alignment compared to state-of-the-art T2I models and creative generation methods.
Abstract（参考訳）: テキスト・トゥ・イメージ(T2I)モデルは画像の忠実度を大幅に改善し、定着を早めたが、それらの創造性は独立した自然言語のプロンプトに依存している。創造的なビニールレコードにインスパイアされた超高層ビル'のようなファジィなプロンプトを提示すると、これらのモデルは基礎となる創造的な意図を推測できず、創造的なアイデアを残し、設計を主に人間に促す。最近の推論やエージェント駆動のアプローチは、反復的なクエリや推論を必要とするが、インスタンス固有の生成によって‘創造性’が高価で再利用不可能になるため、計算と金銭のコストが増大する。これを解決するために, エージェントの「創造性」に関する本質的な理解をカプセル化した, \textbf{C}reative \textbf{A}gent \textbf{T}okenization のためのフレームワークである \textbf{CAT} を紹介する。ファジィプロンプトの埋め込みを考えると、トークン化子は再利用可能なトークンテンプレートを生成し、それらと直接結合して、反復的な推論やプロンプト拡張なしに、創造的なセマンティクスをT2Iモデルに注入することができる。これを可能にするために、トークンライザは、部分的に重複する概念ペア間の関係を利用して、エージェントの潜む創造的表現をキャプチャする、創造的な意味的不絡を通じて訓練される。 CATはT2I世代における創造性を高めるためにスケーラブルで効果的なパラダイムを提供し、T2I世代において3.7\times$のスピードアップと4.8\times$の計算コストの削減を実現し、最先端のT2Iモデルや創造的生成方法よりも優れた人間の嗜好とテキストイメージアライメントを持つ画像を生成する。

論文の概要: A Creative Agent is Worth a 64-Token Template

関連論文リスト