Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training
- URL: http://arxiv.org/abs/2412.08221v3
- Date: Thu, 09 Oct 2025 23:10:23 GMT
- Title: Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training
- Authors: Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna,
- Abstract summary: We introduce Generate Any Scene, a data engine that enumerates scene graphs representing an array of possible visual scenes.<n>Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation.<n>It also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment.
- Score: 61.75337990107149
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.
Related papers
- ViStoryBench: Comprehensive Benchmark Suite for Story Visualization [23.274981415638837]
ViStoryBench is an evaluation benchmark for story visualization models.<n>It features stories with single and multiple protagonists to test models' ability to maintain character consistency.<n>It includes complex plots and intricate world-building to challenge models in generating accurate visuals.
arXiv Detail & Related papers (2025-05-30T17:58:21Z) - The Scene Language: Representing Scenes with Programs, Words, and Embeddings [23.707974056165042]
We introduce the Scene Language, a visual scene representation that concisely and precisely describes the structure, semantics, and identity of visual scenes.
It represents a scene with three key components: a program that specifies the hierarchical and relational structure of entities in the scene, words in natural language that summarize the semantic class of each entity, and embeddings that capture the visual identity of each entity.
arXiv Detail & Related papers (2024-10-22T07:40:20Z) - Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation [44.457347230146404]
We leverage the scene graph, a powerful structured representation, for complex image generation.
We employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner.
Our method outperforms recent competitors based on text, layout, or scene graph.
arXiv Detail & Related papers (2024-10-01T07:02:46Z) - GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs [74.98581417902201]
We propose a novel framework to generate compositional 3D scenes from scene graphs.
By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model.
We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer.
arXiv Detail & Related papers (2023-11-30T18:59:58Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z) - SGFormer: Semantic Graph Transformer for Point Cloud-based 3D Scene
Graph Generation [46.14140601855313]
We propose a novel model called SGFormer, Semantic Graph TransFormer for point cloud-based 3D scene graph generation.
The task aims to parse a point cloud-based scene into a semantic structural graph, with the core challenge of modeling the complex global structure.
arXiv Detail & Related papers (2023-03-20T11:59:23Z) - Iterative Scene Graph Generation with Generative Transformers [6.243995448840211]
Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format.
Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene.
This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction.
arXiv Detail & Related papers (2022-11-30T00:05:44Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - Iterative Scene Graph Generation [55.893695946885174]
Scene graph generation involves identifying object entities and their corresponding interaction predicates in a given image (or video)
Existing approaches to scene graph generation assume certain factorization of the joint distribution to make the estimation iteration feasible.
We propose a novel framework that addresses this limitation, as well as introduces dynamic conditioning on the image.
arXiv Detail & Related papers (2022-07-27T10:37:29Z) - Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using
Scene Graphs [85.54212143154986]
Controllable scene synthesis consists of generating 3D information that satisfy underlying specifications.
Scene graphs are representations of a scene composed of objects (nodes) and inter-object relationships (edges)
We propose the first work that directly generates shapes from a scene graph in an end-to-end manner.
arXiv Detail & Related papers (2021-08-19T17:59:07Z) - Semantic Compositional Learning for Low-shot Scene Graph Generation [122.51930904132685]
Many scene graph generation (SGG) models solely use the limited annotated relation triples for training.
We propose a novel semantic compositional learning strategy that makes it possible to construct additional, realistic relation triples.
For three recent SGG models, adding our strategy improves their performance by close to 50%, and all of them substantially exceed the current state-of-the-art.
arXiv Detail & Related papers (2021-08-19T10:13:55Z) - Unconditional Scene Graph Generation [72.53624470737712]
We develop a deep auto-regressive model called SceneGraphGen which can learn the probability distribution over labelled and directed graphs.
We show that the scene graphs generated by SceneGraphGen are diverse and follow the semantic patterns of real-world scenes.
arXiv Detail & Related papers (2021-08-12T17:57:16Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [72.52804406378023]
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web.
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
arXiv Detail & Related papers (2020-03-01T03:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.