Fugu-MT 論文翻訳(概要): Modality Forcing for Scalable Spatial Generation

論文の概要: Modality Forcing for Scalable Spatial Generation

arxiv url: http://arxiv.org/abs/2606.13676v1
Date: Thu, 11 Jun 2026 17:59:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.984673
Title: Modality Forcing for Scalable Spatial Generation
Title（参考訳）: スケーラブルな空間生成のためのモダリティ強制
Authors: Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski, Justin Johnson, Keunhong Park,
Abstract要約: テキスト・トゥ・イメージ(T2I)モデルは、豊富な空間的事前を含む。先行研究は、深度予測に先立ってT2Iモデルを適用するが、深度データが必要であり、複雑なレシピが伴う。スパース深度データに基づいて訓練された1つのDiTを用いて、共同画像深度生成のためのシンプルでスケーラブルなポストトレーニングレシピであるModality Forcingを提案する。
参考スコア（独自算出の注目度）: 54.04539566839143
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. https://modality-forcing.github.io/
Abstract（参考訳）: テキスト・トゥ・イメージ(T2I)モデルは、豊富な空間的事前を含む。フォトリアリスティックで散らばったシーンを合成するには、視点や相対スケールを含む幾何学の理解が必要である。先行研究は、深度予測に先立ってT2Iモデルを適用するが、深度データが必要であり、複雑なレシピが伴う。スパース深度データに基づいて訓練された1つのDiTを用いて、共同画像深度生成のためのシンプルでスケーラブルなポストトレーニングレシピであるModality Forcingを提案する。モーダリティ強制は、モーダリティ毎に別々のノイズレベルを割り当てることにより、任意の置換における画像と深さの条件付きおよび共同生成を可能にする。モードごとのデコーダにより、スパースで現実世界の深度をトレーニングし、強力で一般化可能な深度予測を実現できます。 T2Iモデルのスクラッチ(370Mから3.3Bパラメータ)からトレーニングすることで、より多くの画像データに基づいてトレーニングされたより大きなモデルにより、より正確な深度が得られることがわかった。我々の最強のモデルは、最先端の単分子深度推定器と競合し、既存の関節画像深度生成モデルと比較して、AbsRelを57%削減する。これらの結果は、画像生成が空間知覚のためのスケーラブルな事前学習対象であることを示す強力な証拠となる。 https://modality-forcing.github.io/

論文の概要: Modality Forcing for Scalable Spatial Generation

関連論文リスト