Related papers: 7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models

7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models

URL: http://arxiv.org/abs/2508.12919v1
Date: Mon, 18 Aug 2025 13:37:51 GMT
Title: 7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models
Authors: Elena Izzo, Luca Parolari, Davide Vezzaro, Lamberto Ballan,
Abstract summary: We introduce 7Bench, the first benchmark to assess both semantic and spatial alignment in layout-guided text-to-image generation.<n>We propose an evaluation protocol that builds on existing frameworks by incorporating the layout alignment score to assess spatial accuracy.
Score: 3.8123588214292745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Layout-guided text-to-image models offer greater control over the generation process by explicitly conditioning image synthesis on the spatial arrangement of elements. As a result, their adoption has increased in many computer vision applications, ranging from content creation to synthetic data generation. A critical challenge is achieving precise alignment between the image, textual prompt, and layout, ensuring semantic fidelity and spatial accuracy. Although recent benchmarks assess text alignment, layout alignment remains overlooked, and no existing benchmark jointly evaluates both. This gap limits the ability to evaluate a model's spatial fidelity, which is crucial when using layout-guided generation for synthetic data, as errors can introduce noise and degrade data quality. In this work, we introduce 7Bench, the first benchmark to assess both semantic and spatial alignment in layout-guided text-to-image generation. It features text-and-layout pairs spanning seven challenging scenarios, investigating object generation, color fidelity, attribute recognition, inter-object relationships, and spatial control. We propose an evaluation protocol that builds on existing frameworks by incorporating the layout alignment score to assess spatial accuracy. Using 7Bench, we evaluate several state-of-the-art diffusion models, uncovering their respective strengths and limitations across diverse alignment tasks. The benchmark is available at https://github.com/Elizzo/7Bench.

Related papers

HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment [84.65251073657883]
We propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry.<n>First, we extract Euclidean features using CLIP and map them to hyperbolic space.<n>Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision.<n>Third, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters.
arXiv Detail & Related papers (2026-01-08T05:41:06Z)
RoomEditor++: A Parameter-Sharing Diffusion Architecture for High-Fidelity Furniture Synthesis [89.26382925677301]
Virtual furniture synthesis holds substantial promise for home design and e-commerce applications.<n>RoomEditor++ is a versatile diffusion-based architecture featuring a parameter-sharing dual diffusion backbone.<n>RoomEditor++ is superior over state-of-the-art approaches in terms of quantitative metrics, qualitative assessments, and human preference studies.
arXiv Detail & Related papers (2025-12-19T13:39:43Z)
UniREditBench: A Unified Reasoning-based Image Editing Benchmark [52.54256348710893]
This work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation.<n>It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions.<n>We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings.
arXiv Detail & Related papers (2025-11-03T07:24:57Z)
OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps [43.782757481408076]
We identify two primary challenges: large overlapping regions and overlapping instances with minimal semantic distinction.<n>We introduce OverLayScore, a novel metric that quantifies the complexity of overlapping bounding boxes.<n>We present Creati-AM, a benchmark featuring high-quality annotations and a balanced distribution across different levels of OverLayScore.
arXiv Detail & Related papers (2025-09-23T17:50:00Z)
CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance [47.59187786346473]
We present CountLoop, a training-free framework that provides diffusion models with accurate instance control.<n>Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98%.
arXiv Detail & Related papers (2025-08-18T11:28:02Z)
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation [28.029569617900894]
RefVNLI is a cost-effective metric that evaluates both textual alignment and subject preservation in a single run.<n>We trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations.
arXiv Detail & Related papers (2025-04-24T12:44:51Z)
What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation [29.42202665594218]
We introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes.<n>Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, and SGScore, a novel evaluation metric.<n>We develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image.
arXiv Detail & Related papers (2024-11-23T03:40:25Z)
LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z)
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation [147.81509219686419]
We propose a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape. Next, we propose IterInpaint, a new baseline that generates foreground and background regions step-by-step via inpainting. We show comprehensive ablation studies on IterInpaint, including training task ratio, crop&paste vs. repaint, and generation order.
arXiv Detail & Related papers (2023-04-13T16:58:33Z)
SceneComposer: Any-Level Semantic Image Synthesis [80.55876413285587]
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels. The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level. We introduce several novel techniques to address the challenges coming with this new setup.
arXiv Detail & Related papers (2022-11-21T18:59:05Z)
Towards Better Text-Image Consistency in Text-to-Image Generation [15.735515302139335]
We develop a novel CLIP-based metric termed as Semantic Similarity Distance (SSD) We further design the Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN), which can fuse semantic information at different granularities. Our PDF-GAN can lead to significantly better text-image consistency while maintaining decent image quality on the CUB and COCO datasets.
arXiv Detail & Related papers (2022-10-27T07:47:47Z)
Person-in-Context Synthesiswith Compositional Structural Space [59.129960774988284]
We propose a new problem, textbfPersons in Context Synthesis, which aims to synthesize diverse person instance(s) in consistent contexts. The context is specified by the bounding box object layout which lacks shape information, while pose of the person(s) by keypoints which are sparsely annotated. To handle the stark difference in input structures, we proposed two separate neural branches to attentively composite the respective (context/person) inputs into shared compositional structural space'' This structural space is then decoded to the image space using multi-level feature modulation strategy, and learned in a self
arXiv Detail & Related papers (2020-08-28T14:33:28Z)
Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion. In this paper, a new paradigm for semantic segmentation is proposed. Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image. We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.