GenSpace: Benchmarking Spatially-Aware Image Generation
- URL: http://arxiv.org/abs/2505.24870v2
- Date: Fri, 06 Jun 2025 14:51:40 GMT
- Title: GenSpace: Benchmarking Spatially-Aware Image Generation
- Authors: Zehan Wang, Jiayang Xu, Ziang Zhang, Tianyu Pang, Chao Du, Hengshuang Zhao, Zhou Zhao,
- Abstract summary: Humans intuitively compose and arrange scenes in the 3D space for photography.<n>Can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts?<n>We present GenSpace, a novel benchmark and evaluation pipeline to assess the spatial awareness of current image generation models.
- Score: 76.98817635685278
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Humans can intuitively compose and arrange scenes in the 3D space for photography. However, can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts? We present GenSpace, a novel benchmark and evaluation pipeline to comprehensively assess the spatial awareness of current image generation models. Furthermore, standard evaluations using general Vision-Language Models (VLMs) frequently fail to capture the detailed spatial errors. To handle this challenge, we propose a specialized evaluation pipeline and metric, which reconstructs 3D scene geometry using multiple visual foundation models and provides a more accurate and human-aligned metric of spatial faithfulness. Our findings show that while AI models create visually appealing images and can follow general instructions, they struggle with specific 3D details like object placement, relationships, and measurements. We summarize three core limitations in the spatial perception of current state-of-the-art image generation models: 1) Object Perspective Understanding, 2) Egocentric-Allocentric Transformation and 3) Metric Measurement Adherence, highlighting possible directions for improving spatial intelligence in image generation.
Related papers
- Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation [27.13700598039439]
This paper presents a novel vision-guided 3D layout generation system.<n>We first construct a high-quality asset library containing 2,037 scene assets and 147 3D scene layouts.<n>We then employ an image generation model to expand prompt representations into images, fine-tuning it to align with our asset library.<n>We optimize the scene layout using scene graphs and overall visual semantics to ensure logical coherence and alignment with the images.
arXiv Detail & Related papers (2025-10-17T11:48:08Z) - Constructing a 3D Scene from a Single Image [31.11317559252235]
SceneFuse-3D is a training-free framework designed to synthesize coherent 3D scenes from a single top-down view.<n>We decompose the input image into overlapping regions and generate each using a pretrained 3D object generator.<n>This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning.
arXiv Detail & Related papers (2025-05-21T17:10:47Z) - Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting [47.014044892025346]
Architect is a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting.
Our pipeline is further extended to a hierarchical and iterative inpainting process to continuously generate placement of large furniture and small objects to enrich the scene.
arXiv Detail & Related papers (2024-11-14T22:15:48Z) - 3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation [51.64796781728106]
We propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior to 2D diffusion model and the global 3D information of the current scene.
Our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
arXiv Detail & Related papers (2024-03-14T14:31:22Z) - Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion [77.34078223594686]
We propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques.
Specifically, our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first, which is then transformed into a scene representation in a feed-forward manner.
Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photo-realistic street-view image sequences and cross-view urban scenes from satellite imagery.
arXiv Detail & Related papers (2024-01-19T16:15:37Z) - Generating Visual Spatial Description via Holistic 3D Scene
Understanding [88.99773815159345]
Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images.
With an external 3D scene extractor, we obtain the 3D objects and scene features for input images.
We construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.
arXiv Detail & Related papers (2023-05-19T15:53:56Z) - RoSI: Recovering 3D Shape Interiors from Few Articulation Images [20.430308190444737]
We present a learning framework to recover the shape interiors of existing 3D models with only their exteriors from multi-view and multi-articulation images.
Our neural architecture is trained in a category-agnostic manner and it consists of a motion-aware multi-view analysis phase.
In addition, our method also predicts part articulations and is able to realize and even extrapolate the captured motions on the target 3D object.
arXiv Detail & Related papers (2023-04-13T08:45:26Z) - Visual Localization using Imperfect 3D Models from the Internet [54.731309449883284]
This paper studies how imperfections in 3D models affect localization accuracy.
We show that 3D models from the Internet show promise as an easy-to-obtain scene representation.
arXiv Detail & Related papers (2023-04-12T16:15:05Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.