Related papers: Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

URL: http://arxiv.org/abs/2412.08221v2
Date: Mon, 16 Dec 2024 09:54:46 GMT
Title: Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming
Authors: Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna,
Abstract summary: We introduce Generate Any Scene, a framework that enumerates scene graphs representing a vast array of visual scenes.<n> Generate Any Scene translates each scene graph into a caption, enabling scalable evaluation of text-to-vision models.<n>We conduct extensive evaluations across text-to-image, text-to-video, and text-to-3D models, presenting key findings on model performance.
Score: 44.32980579195508
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: DALL-E and Sora have gained attention by producing implausible images, such as "astronauts riding a horse in space." Despite the proliferation of text-to-vision models that have inundated the internet with synthetic visuals, from images to 3D assets, current benchmarks predominantly evaluate these models on real-world scenes paired with captions. We introduce Generate Any Scene, a framework that systematically enumerates scene graphs representing a vast array of visual scenes, spanning realistic to imaginative compositions. Generate Any Scene leverages 'scene graph programming', a method for dynamically constructing scene graphs of varying complexity from a structured taxonomy of visual elements. This taxonomy includes numerous objects, attributes, and relations, enabling the synthesis of an almost infinite variety of scene graphs. Using these structured representations, Generate Any Scene translates each scene graph into a caption, enabling scalable evaluation of text-to-vision models through standard metrics. We conduct extensive evaluations across multiple text-to-image, text-to-video, and text-to-3D models, presenting key findings on model performance. We find that DiT-backbone text-to-image models align more closely with input captions than UNet-backbone models. Text-to-video models struggle with balancing dynamics and consistency, while both text-to-video and text-to-3D models show notable gaps in human preference alignment. We demonstrate the effectiveness of Generate Any Scene by conducting three practical applications leveraging captions generated by Generate Any Scene: 1) a self-improving framework where models iteratively enhance their performance using generated data, 2) a distillation process to transfer specific strengths from proprietary models to open-source counterparts, and 3) improvements in content moderation by identifying and generating challenging synthetic data.

Related papers

ViStoryBench: Comprehensive Benchmark Suite for Story Visualization [23.274981415638837]
ViStoryBench is an evaluation benchmark for story visualization models.<n>It features stories with single and multiple protagonists to test models' ability to maintain character consistency.<n>It includes complex plots and intricate world-building to challenge models in generating accurate visuals.
arXiv Detail & Related papers (2025-05-30T17:58:21Z)
The Scene Language: Representing Scenes with Programs, Words, and Embeddings [23.707974056165042]
We introduce the Scene Language, a visual scene representation that concisely and precisely describes the structure, semantics, and identity of visual scenes. It represents a scene with three key components: a program that specifies the hierarchical and relational structure of entities in the scene, words in natural language that summarize the semantic class of each entity, and embeddings that capture the visual identity of each entity.
arXiv Detail & Related papers (2024-10-22T07:40:20Z)
GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs [74.98581417902201]
We propose a novel framework to generate compositional 3D scenes from scene graphs. By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer.
arXiv Detail & Related papers (2023-11-30T18:59:58Z)
CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z)
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image. We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image. Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [72.52804406378023]
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
arXiv Detail & Related papers (2020-03-01T03:44:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.