Fugu-MT 論文翻訳(概要): Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

論文の概要: Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

arxiv url: http://arxiv.org/abs/2412.08221v2
Date: Mon, 16 Dec 2024 09:54:46 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-17 15:49:59.468294
Title: Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming
Title（参考訳）: あらゆるシーンを生成する: シーングラフプログラミングによるテキスト・ツー・ビジョン生成の評価と改善
Authors: Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna,
Abstract要約: シーングラフを列挙するフレームワークであるGenerate Any Sceneを紹介した。 Any Sceneを生成することで、各シーングラフをキャプションに変換し、テキスト・ツー・ビジョンモデルのスケーラブルな評価を可能にする。我々は,テキスト・ツー・イメージ,テキスト・ツー・ビデオ,テキスト・ツー・3Dモデルに対して広範囲な評価を行い,モデル性能に関する重要な知見を提示する。
参考スコア（独自算出の注目度）: 44.32980579195508
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: DALL-E and Sora have gained attention by producing implausible images, such as "astronauts riding a horse in space." Despite the proliferation of text-to-vision models that have inundated the internet with synthetic visuals, from images to 3D assets, current benchmarks predominantly evaluate these models on real-world scenes paired with captions. We introduce Generate Any Scene, a framework that systematically enumerates scene graphs representing a vast array of visual scenes, spanning realistic to imaginative compositions. Generate Any Scene leverages 'scene graph programming', a method for dynamically constructing scene graphs of varying complexity from a structured taxonomy of visual elements. This taxonomy includes numerous objects, attributes, and relations, enabling the synthesis of an almost infinite variety of scene graphs. Using these structured representations, Generate Any Scene translates each scene graph into a caption, enabling scalable evaluation of text-to-vision models through standard metrics. We conduct extensive evaluations across multiple text-to-image, text-to-video, and text-to-3D models, presenting key findings on model performance. We find that DiT-backbone text-to-image models align more closely with input captions than UNet-backbone models. Text-to-video models struggle with balancing dynamics and consistency, while both text-to-video and text-to-3D models show notable gaps in human preference alignment. We demonstrate the effectiveness of Generate Any Scene by conducting three practical applications leveraging captions generated by Generate Any Scene: 1) a self-improving framework where models iteratively enhance their performance using generated data, 2) a distillation process to transfer specific strengths from proprietary models to open-source counterparts, and 3) improvements in content moderation by identifying and generating challenging synthetic data.
Abstract（参考訳）: ダールエとソラは「宇宙で馬に乗る宇宙飛行士」など、不名誉なイメージを制作することで注目を集めている。画像から3Dアセットまで、インターネットに合成視覚で浸水したテキスト・ツー・ビジョンモデルの普及にもかかわらず、現在のベンチマークでは、これらのモデルをキャプションと組み合わせた現実世界のシーンで主に評価している。我々は,シーングラフを多岐にわたる視覚的シーンを体系的に列挙するフレームワークであるGenerate Any Sceneを紹介した。 Any Sceneを生成するには、視覚要素の構造的分類から様々な複雑さのシーングラフを動的に構築する「シーングラフプログラミング」を利用する。この分類法は多数の対象、属性、関係を含み、ほぼ無限のシーングラフの合成を可能にする。これらの構造化表現を使用して、Generate Any Sceneは各シーングラフをキャプションに変換し、標準メトリクスによるテキスト・ツー・ビジョンモデルのスケーラブルな評価を可能にする。我々は,複数のテキスト・ツー・イメージ,テキスト・ツー・ビデオ,テキスト・ツー・3Dモデルに対して広範な評価を行い,モデル性能に関する重要な知見を提示する。 DiT-backbone text-to-image modelはUNet-backbone modelよりも入力キャプションとより密接に一致している。テキスト・ツー・ビデオモデルはダイナミック性と一貫性のバランスに苦しむ一方で、テキスト・ツー・ビデオモデルとテキスト・ツー・3Dモデルの両方では、人間の嗜好のアライメントに顕著なギャップが見られる。我々は、任意のシーンを生成することによって生成されたキャプションを利用した3つの実践的な応用を行うことにより、任意のシーンを生成することの有効性を実証する。 1 モデルが生成データを用いて反復的に性能を向上する自己改善フレームワーク。 2 特定強度をプロプライエタリモデルからオープンソースモデルへ移転させる蒸留工程及び 3)難解な合成データの識別と生成によるコンテンツモデレーションの改善。

論文の概要: Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

関連論文リスト