Fugu-MT 論文翻訳(概要): Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

論文の概要: Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

arxiv url: http://arxiv.org/abs/2508.20470v1
Date: Thu, 28 Aug 2025 06:39:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:02.088951
Title: Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation
Title（参考訳）: Droplet3D:Droplet3Dは3D世代を魅了するビデオからコモンセンスに先んじる
Authors: Xiaochuan Li, Guoguang Du, Runze Zhang, Liang Jin, Qi Jia, Lihua Lu, Zhenhua Guo, Yaqian Zhao, Haiyang Liu, Tianqi Wang, Changsheng Li, Xiaoli Gong, Rengang Li, Baoyu Fan,
Abstract要約: 本稿では,データセットをモデルに分散した3次元アセット生成にビデオモダリティを適用する方法について検討する。マルチビューレベルのアノテーションを備えた最初の大規模ビデオデータセットであるDroplet3D-4Mを導入し、画像入力と高密度テキスト入力の両方をサポートする生成モデルであるDroplet3Dをトレーニングする。
参考スコア（独自算出の注目度）: 44.64235988574981
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.
Abstract（参考訳）: スケーリング法則は、テキスト、画像、ビデオドメインにわたる創造的生成において、大規模なデータトレーニングモデルの成功と約束を検証する。しかし、このパラダイムは3Dドメインでのデータ不足に直面している。幸いなことに、ネイティブな3Dデータによって引き起こされる一般化ボトルネックを軽減するために、代替の監視信号を提供する、本来はコモンセンス以前のものを含む適切なビデオが存在する。一方、オブジェクトやシーンの複数のビューをキャプチャするビデオは、3D生成に先立って空間的な一貫性を提供する。一方、ビデオに含まれるリッチなセマンティック情報により、生成されたコンテンツはテキストのプロンプトにより忠実になり、セマンティックに妥当である。本稿では,データセットをモデルに分散した3次元アセット生成にビデオモダリティを適用する方法について検討する。マルチビューレベルのアノテーションを備えた最初の大規模ビデオデータセットであるDroplet3D-4Mを導入し、画像入力と高密度テキスト入力の両方をサポートする生成モデルであるDroplet3Dをトレーニングする。本手法の有効性を検証し,空間的に一貫した,意味論的に検証可能なコンテンツを生成できることを実証した。さらに,一般的な3Dソリューションとは対照的に,本手法はシーンレベルのアプリケーションへの拡張の可能性を示す。これは、ビデオのコモンセンスが3D作成を著しく促進していることを示している。データセット、コード、テクニカルフレームワーク、モデルウェイトを含むすべてのリソースをオープンソース化しました。

論文の概要: Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

関連論文リスト