Fugu-MT 論文翻訳(概要): 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling

論文の概要: 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling

arxiv url: http://arxiv.org/abs/2311.17984v2
Date: Sun, 26 May 2024 10:15:13 GMT
ステータス: 翻訳完了
システム内更新日: 2024-05-29 08:25:17.031470
Title: 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
Title（参考訳）: 4D-fy:ハイブリッドスコア蒸留サンプリングによるテキストから4D生成
Authors: Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, David B. Lindell,
Abstract要約: 現在のテキストから4Dの手法は、シーンの外観の質、立体構造、動きの3方向のトレードオフに直面している。本稿では,複数の事前学習拡散モデルからの監視信号をブレンドする交互最適化手法であるハイブリッドスコア蒸留法を提案する。
参考スコア（独自算出の注目度）: 91.99172731031206
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent breakthroughs in text-to-4D generation rely on pre-trained text-to-image and text-to-video models to generate dynamic 3D scenes. However, current text-to-4D methods face a three-way tradeoff between the quality of scene appearance, 3D structure, and motion. For example, text-to-image models and their 3D-aware variants are trained on internet-scale image datasets and can be used to produce scenes with realistic appearance and 3D structure -- but no motion. Text-to-video models are trained on relatively smaller video datasets and can produce scenes with motion, but poorer appearance and 3D structure. While these models have complementary strengths, they also have opposing weaknesses, making it difficult to combine them in a way that alleviates this three-way tradeoff. Here, we introduce hybrid score distillation sampling, an alternating optimization procedure that blends supervision signals from multiple pre-trained diffusion models and incorporates benefits of each for high-fidelity text-to-4D generation. Using hybrid SDS, we demonstrate synthesis of 4D scenes with compelling appearance, 3D structure, and motion.
Abstract（参考訳）: 最近のテキスト・ツー・4D生成のブレークスルーは、動的3Dシーンを生成するために、事前訓練されたテキスト・ツー・イメージとテキスト・ツー・ビデオモデルに依存している。しかし、現在のテキストから4Dの手法は、シーンの外観、立体構造、動きの質の3方向のトレードオフに直面している。例えば、テキスト・ツー・イメージモデルとその3D対応モデルは、インターネット規模の画像データセットに基づいてトレーニングされており、現実的な外観と3D構造を持つシーンを生成するために使用できる。テキスト・トゥ・ビデオモデルは比較的小さなビデオデータセットでトレーニングされており、動きのあるシーンを生成することができるが、外観や3D構造はより貧弱である。これらのモデルには相補的な長所があるが、相補的な短所もあるため、この3方向のトレードオフを軽減する方法でそれらを組み合わせることは困難である。本稿では,複数の事前学習拡散モデルからの監視信号をブレンドする交互最適化手法であるハイブリッドスコア蒸留法について紹介する。ハイブリッドSDSを用いて, 魅力的な外観, 3次元構造, 動きを持つ4次元シーンの合成を実演する。

論文の概要: 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling

関連論文リスト