Fugu-MT 論文翻訳(概要): Text-Image Conditioned 3D Generation

論文の概要: Text-Image Conditioned 3D Generation

arxiv url: http://arxiv.org/abs/2603.21295v1
Date: Sun, 22 Mar 2026 15:36:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.337252
Title: Text-Image Conditioned 3D Generation
Title（参考訳）: テキスト画像による3次元生成
Authors: Jiazhong Cen, Jiemin Fang, Sikuang Li, Guanjun Wu, Chen Yang, Taoran Yi, Zanwei Zhou, Zhikuan Bao, Lingxi Xie, Wei Shen, Qi Tian,
Abstract要約: TIGONは、イメージとテキスト条件のバックボーンと軽量なクロスモーダル融合を備えた、最小限のデュアルブランチベースラインである。我々の診断研究は、テキストと画像条件の単純な融合でさえ、単一モダリティモデルよりも優れていることを示している。大規模な実験により、テキスト画像の条件付けは単一モダリティ法よりも一貫して改善されることが示された。
参考スコア（独自算出の注目度）: 71.98375600100856
License: http://creativecommons.org/licenses/by/4.0/
Abstract: High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON, a minimalist dual-branch baseline with separate image- and text-conditioned backbones and lightweight cross-modal fusion. Extensive experiments show that text-image conditioning consistently improves over single-modality methods, highlighting complementary vision-language guidance as a promising direction for future 3D generation research. Project page: https://jumpat.github.io/tigon-page
Abstract（参考訳）: 高品質な3DアセットはVR/AR、工業デザイン、エンターテイメントに不可欠であり、ユーザープロンプトから3Dコンテンツを生成する生成モデルへの関心が高まっている。しかし、既存の多くの3Dジェネレータは、単一の条件付けのモダリティに依存している: 画像条件付きモデルは、ピクセル配列のキューを利用して高い視覚的忠実性を達成するが、入力ビューが限定的または曖昧である場合、視点バイアスに悩まされる一方、テキスト条件付きモデルは、広義のセマンティックガイダンスを提供するが、低レベルな視覚的詳細を欠いている。この2つのモダリティは、より柔軟で忠実な3D生成のために組み合わせられるのか? 我々の診断研究は、テキストと画像条件の単純な融合でさえ単一モダリティモデルよりも優れており、強い相互補完性を示していることを示している。そこで我々は,視覚的見本とテキストの仕様に関する共同推論を必要とするテキスト画像条件付き3D生成を定式化する。この課題に対処するために、TIGONは、画像とテキスト条件の異なるバックボーンと軽量なクロスモーダル融合を備えた、最小限のデュアルブランチベースラインである。広汎な実験により、テキスト画像の条件付けは単一モダリティ法よりも一貫して改善され、将来の3D世代研究の有望な方向性として補完的な視覚言語指導が強調される。プロジェクトページ: https://jumpat.github.io/tigon-page

論文の概要: Text-Image Conditioned 3D Generation

関連論文リスト