Fugu-MT 論文翻訳(概要): UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation

論文の概要: UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation

arxiv url: http://arxiv.org/abs/2606.04264v1
Date: Tue, 02 Jun 2026 22:30:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.407754
Title: UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation
Title（参考訳）: UniCanvas: テキスト・イン・イメージ共同生成のための拡散ベース統一モデル
Authors: Zeyuan Yang, Hao-Wei Chen, Xueyang Yu, Yuncong Yang, Haoyu Zhen, Ziqiao Ma, Maohao Shen, Chuang Gan,
Abstract要約: テキスト・イン・イメージ・ジェネレーションにより、拡散モデルを統一し、インターリーブされたマルチモーダルコンテンツを生成する試みを提案する。個々のテキストトークンを生成する代わりに、モデルは言語を画像内の視覚的パターンとして表現することを学ぶ。この設計により、画像合成中に1ピクセルのキャンバス内でテキストを自然に「描画」することができ、シームレスなマルチモーダル生成を実現することができる。
参考スコア（独自算出の注目度）: 33.71491309079163
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent years have seen remarkable progress in unified vision-language models handling both multimodal understanding and generation within a single architecture. While autoregressive VLMs can reason across modalities, they fail to generate high-quality images. In contrast, diffusion models produce photorealistic visuals yet struggle to generate coherent text, making it challenging to develop a single unified model that can seamlessly handle both visual and text generation. Recent advances suggest that language can be effectively embedded within visual representations, allowing models to reason about textual semantics directly from images. To this end, we propose UniCanvas, a first attempt that unifies diffusion models to generate interleaved multimodal contents through text-in-image generation. Diffusion models naturally capture transformations on a shared pixel canvas, which can be viewed as world models of visual change. Instead of producing discrete text tokens, the model learns to represent language as visual patterns inside images, leveraging its inherent multimodal embedding space. This design allows the model to "draw" text naturally within a single pixel canvas during image synthesis, achieving seamless multimodal generation. Experiments demonstrate that UniCanvas improves performance over previous unified models, positioning text-in-image generation with diffusion models as a promising unified multimodal generation paradigm.
Abstract（参考訳）: 近年では、単一のアーキテクチャ内でのマルチモーダル理解と生成の両方を扱う統合視覚言語モデルが顕著に進歩している。自己回帰VLMはモダリティを越えて推論できるが、高品質な画像の生成には失敗する。対照的に拡散モデルはフォトリアリスティックな視覚を生成できるが、コヒーレントなテキストを生成するのに苦労しているため、視覚とテキストの両方をシームレスに扱える単一の統一モデルを開発することは困難である。近年の進歩は、言語を視覚表現に効果的に組み込むことができ、モデルが画像から直接テキストの意味を推論できることを示唆している。そこで本研究では,テキスト・イン・イメージ・ジェネレーションによるインターリーブ付きマルチモーダルコンテンツを生成するために,拡散モデルを統合する最初の試みであるUniCanvasを提案する。拡散モデルは自然に共有画素キャンバス上の変換をキャプチャし、これは視覚変化の世界モデルと見なすことができる。個別のテキストトークンを生成する代わりに、モデルは言語を画像内の視覚的パターンとして表現することを学び、その固有のマルチモーダル埋め込み空間を活用する。この設計により、画像合成中に1ピクセルのキャンバス内でテキストを自然に「描画」することができ、シームレスなマルチモーダル生成を実現することができる。実験により、UniCanvasは従来の統一モデルよりも性能を改善し、拡散モデルによるテキスト・イン・イメージ生成を有望な統一マルチモーダル生成パラダイムとして位置づけた。

論文の概要: UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation

関連論文リスト