Fugu-MT 論文翻訳(概要): Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

論文の概要: Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

arxiv url: http://arxiv.org/abs/2412.08221v3
Date: Thu, 09 Oct 2025 23:10:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:44.895619
Title: Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training
Title（参考訳）: あらゆるシーンを生成する: ビジュアルジェネレーショントレーニングのためのシーングラフ駆動データ合成
Authors: Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna,
Abstract要約: データエンジンであるGenerate Any Sceneを紹介し、視覚的なシーンの配列を表すシーングラフを列挙する。サンプルのシーングラフが与えられた場合、Generate Any Sceneはそれを、テキスト・トゥ・イメージやテキスト・トゥ・ビデオ生成のためのキャプションに変換する。また、視覚的な質問応答の集合に翻訳し、意味的アライメントの自動評価と報酬モデリングを可能にする。
参考スコア（独自算出の注目度）: 61.75337990107149
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.
Abstract（参考訳）: テキスト・ツー・ビジョン生成の最近の進歩は、視覚的忠実度に優れるが、構成的一般化と意味的アライメントに苦慮している。既存のデータセットはノイズが多く、構成が弱いため、複雑なシーンに対するモデルの理解が制限される一方、密集した高品質なアノテーションに対するスケーラブルなソリューションは依然として課題である。データエンジンであるGenerate Any Sceneを導入し、視覚シーンの組合せ配列を表すシーングラフを体系的に列挙する。 Any Sceneの生成は、オブジェクト、属性、関係の構造化された分類から、様々な複雑さのシーングラフを動的に構築する。サンプル化されたシーングラフが与えられた場合、Generate Any Sceneはそれをテキスト・ツー・イメージやテキスト・トゥ・ビデオ生成のキャプションに変換する。 Generate Any Sceneを使って、モデルが生成したデータを使って反復的にパフォーマンスを向上する自己改善フレームワークを最初に設計する。安定拡散v1.5は、ベースラインよりも平均4%改善し、CC3Mの微調整を超える。第二に、プロプライエタリなモデルからオープンソースモデルへの特定の強みを伝達する蒸留アルゴリズムを設計する。 800種未満の合成キャプションを用いて, 安定拡散v1.5を微調整し, 合成および硬質概念生成におけるTIFAスコアを10%増加させた。第三に、モデル生成と意味的精度を低コストで整合させる報酬モデルを作成する。 GRPOアルゴリズムを用いてSimpleAR-0.5B-SFTを微調整し,DPG-BenchでCLIP法を+5%超えた。最後に、これらのアイデアをコンテンツモデレーションの下流タスクに適用し、合成データから学習することで、困難なケースを特定するためにモデルを訓練する。

論文の概要: Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

関連論文リスト