Fugu-MT 論文翻訳(概要): CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

論文の概要: CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

arxiv url: http://arxiv.org/abs/2603.08652v1
Date: Mon, 09 Mar 2026 17:31:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:16.610052
Title: CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation
Title（参考訳）: CoCo:テキストから画像へのプレビューとレアコンセプト生成のためのコード・アズ・CoT
Authors: Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou, Hongbo Peng, Dingming Li, Yuhong Dai, ZePeng Lin, Juanxi Tian, Yi Zhou, Siqi Dai, Jingwei Wu,
Abstract要約: コード駆動推論フレームワークであるCoCo(Code-as-CoT)を提案する。テキストプロンプトが与えられた後、CoCoは最初にシーンの構造レイアウトを指定する実行可能なコードを生成し、サンドボックス環境で実行され、決定論的ドラフト画像を表示する。このモデルはその後、微細な画像編集によってこのドラフトを洗練し、最終的な高忠実度結果を生成する。
参考スコア（独自算出の注目度）: 17.789454097040366
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the precision required for complex spatial layouts, structured visual elements, and dense textual content. In this work, we propose CoCo (Code-as-CoT), a code-driven reasoning framework that represents the reasoning process as executable code, enabling explicit and verifiable intermediate planning for image generation. Given a text prompt, CoCo first generates executable code that specifies the structural layout of the scene, which is then executed in a sandboxed environment to render a deterministic draft image. The model subsequently refines this draft through fine-grained image editing to produce the final high-fidelity result. To support this training paradigm, we construct CoCo-10K, a curated dataset containing structured draft-final image pairs designed to teach both structured draft construction and corrective visual refinement. Empirical evaluations on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo achieves improvements of +68.83%, +54.8%, and +41.23% over direct generation, while also outperforming other generation methods empowered by CoT. These results demonstrate that executable code is an effective and reliable reasoning paradigm for precise, controllable, and structured text-to-image generation. The code is available at: https://github.com/micky-li-hd/CoCo
Abstract（参考訳）: 最近のUMM(Unified Multimodal Models)の進歩は、特にCoT(Chain-of-Thought)推論の統合によって、T2I(Text-to-image)生成が著しく進歩している。しかし、既存のCoTベースのT2I手法は、複雑な空間配置、構造化された視覚要素、高密度テキストコンテンツに必要な精度に欠ける抽象的な自然言語プランニングに大きく依存している。本研究では,コード駆動推論フレームワークであるCoCo(Code-as-CoT)を提案する。テキストプロンプトが与えられた後、CoCoはまずシーンの構造的レイアウトを指定する実行可能なコードを生成し、サンドボックス環境で実行され、決定論的ドラフト画像を表示する。このモデルはその後、微細な画像編集によってこのドラフトを洗練し、最終的な高忠実度結果を生成する。このトレーニングパラダイムをサポートするために、構造化されたドラフト-ファイナルイメージペアを含むキュレートデータセットであるCoCo-10Kを構築し、構造化されたドラフト構築と修正されたビジュアルリファインメントの両方を教える。 StructT2IBench、OneIG-Bench、LongText-Benchの実証的な評価によると、CoCoは直接発生よりも+68.83%、+54.8%、+41.23%向上し、CoTによって強化された他の生成方法よりも優れている。これらの結果は、実行可能コードは正確で制御可能で構造化された画像生成のための効果的で信頼性の高い推論パラダイムであることを示している。コードは、https://github.com/micky-li-hd/CoCoで入手できる。

論文の概要: CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

関連論文リスト