Related papers: What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

URL: http://arxiv.org/abs/2411.15435v1
Date: Sat, 23 Nov 2024 03:40:25 GMT
Title: What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation
Authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen,
Abstract summary: We introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, and SGScore, a novel evaluation metric. We develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image.
Score: 29.42202665594218
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While text-to-image generation has been extensively studied, generating images from scene graphs remains relatively underexplored, primarily due to challenges in accurately modeling spatial relationships and object interactions. To fill this gap, we introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, facilitating the training and fair comparison of models across diverse and complex scenes. Additionally, we propose SGScore, a novel evaluation metric that leverages chain-of-thought reasoning capabilities of multimodal large language models (LLMs) to assess both object presence and relationship accuracy, offering a more effective measure of factual consistency than traditional metrics like FID and CLIPScore. Building upon this evaluation framework, we develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image. Extensive experiments demonstrate that Scene-Bench provides a more comprehensive and effective evaluation framework compared to existing benchmarks, particularly for complex scene generation. Furthermore, our feedback strategy significantly enhances the factual consistency of image generation models, advancing the field of controllable image generation.

Related papers

CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance [47.59187786346473]
We present CountLoop, a training-free framework that provides diffusion models with accurate instance control.<n>Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98%.
arXiv Detail & Related papers (2025-08-18T11:28:02Z)
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization [23.274981415638837]
ViStoryBench is an evaluation benchmark for story visualization models.<n>It features stories with single and multiple protagonists to test models' ability to maintain character consistency.<n>It includes complex plots and intricate world-building to challenge models in generating accurate visuals.
arXiv Detail & Related papers (2025-05-30T17:58:21Z)
Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment [53.45813302866466]
We present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG evaluates responses on four levels of granularity: holistic, structural, block-level, and image-specific. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories.
arXiv Detail & Related papers (2024-11-26T07:55:57Z)
KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models. We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals. Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z)
SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance [46.77060502803466]
We introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. The SG-Adapter's explicit and non-fully connected graph representation greatly improves the fully connected, transformer-based text representations.
arXiv Detail & Related papers (2024-05-24T08:00:46Z)
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning. Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z)
Controllable Image Generation via Collage Representations [31.456445433105415]
"Mixing and matching scenes" (M&Ms) is an approach that consists of an adversarially trained generative image model conditioned on appearance features and spatial positions of the different elements in a collage. We show that M&Ms outperforms baselines in terms of fine-grained scene controllability while being very competitive in terms of image quality and sample diversity.
arXiv Detail & Related papers (2023-04-26T17:58:39Z)
DisPositioNet: Disentangled Pose and Identity in Semantic Image Manipulation [83.51882381294357]
DisPositioNet is a model that learns a disentangled representation for each object for the task of image manipulation using scene graphs. Our framework enables the disentanglement of the variational latent embeddings as well as the feature representation in the graph.
arXiv Detail & Related papers (2022-11-10T11:47:37Z)
Iterative Scene Graph Generation [55.893695946885174]
Scene graph generation involves identifying object entities and their corresponding interaction predicates in a given image (or video) Existing approaches to scene graph generation assume certain factorization of the joint distribution to make the estimation iteration feasible. We propose a novel framework that addresses this limitation, as well as introduces dynamic conditioning on the image.
arXiv Detail & Related papers (2022-07-27T10:37:29Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
Exploiting Relationship for Complex-scene Image Generation [43.022978211274065]
This work explores relationship-aware complex-scene image generation, where multiple objects are inter-related as a scene graph. We propose three major updates in the generation framework. First, reasonable spatial layouts are inferred by jointly considering the semantics and relationships among objects. Second, since the relations between objects significantly influence an object's appearance, we design a relation-guided generator to generate objects reflecting their relationships. Third, a novel scene graph discriminator is proposed to guarantee the consistency between the generated image and the input scene graph.
arXiv Detail & Related papers (2021-04-01T09:21:39Z)
Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects. Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity. We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.