Related papers: RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection

RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection

URL: http://arxiv.org/abs/2602.19974v1
Date: Mon, 23 Feb 2026 15:39:53 GMT
Title: RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection
Authors: Tianyu Wang, Zhiyuan Ma, Qian Wang, Xinyi Zhang, Xinwei Long, Bowen Zhou,
Abstract summary: RL-RIG is a Reinforcement Learning framework for Reflection-based Image Generation.<n>We develop Reflection-GRPO to train the VLM Actor for edit prompts and the Image Editor for better image quality under a given prompt.<n> Experimental results show that RL-RIG outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.
Score: 18.52946282633359
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in image generation have achieved impressive results in producing high-quality images. However, existing image generation models still generally struggle with a spatial reasoning dilemma, lacking the ability to accurately capture fine-grained spatial relationships from the prompt and correctly generate scenes with structural integrity. To mitigate this dilemma, we propose RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation. Our architecture comprises four primary components: Diffuser, Checker, Actor, and Inverse Diffuser, following a Generate-Reflect-Edit paradigm to spark the Chain of Thought reasoning ability in image generation for addressing the dilemma. To equip the model with better intuition over generation trajectories, we further develop Reflection-GRPO to train the VLM Actor for edit prompts and the Image Editor for better image quality under a given prompt, respectively. Unlike traditional approaches that solely produce visually stunning yet structurally unreasonable content, our evaluation metrics prioritize spatial accuracy, utilizing Scene Graph IoU and employing a VLM-as-a-Judge strategy to assess the spatial consistency of generated images on LAION-SG dataset. Experimental results show that RL-RIG outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.

Related papers

Interleaving Reasoning for Better Text-to-Image Generation [83.69082794730664]
We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis.<n>To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals.<n>Experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN.
arXiv Detail & Related papers (2025-09-08T17:56:23Z)
AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning [56.71089466532673]
We propose AR-GRPO, an approach to integrate online RL training into autoregressive (AR) image generation models.<n>We conduct comprehensive experiments on both class-conditional (i.e., class-to-image) and text-conditional (i.e., text-to-image) image generation tasks.<n>Our results show consistent improvements across various evaluation metrics.
arXiv Detail & Related papers (2025-08-09T10:37:26Z)
HRR: Hierarchical Retrospection Refinement for Generated Image Detection [16.958383381415445]
We propose a diffusion model-based generative image detection framework termed Hierarchical Retrospection Refinement(HRR)<n>The HRR framework consistently delivers significant performance improvements, outperforming state-of-the-art methods in generated image detection task.
arXiv Detail & Related papers (2025-02-25T05:13:44Z)
Autoregressive Image Generation with Vision Full-view Prompt [18.569610688433745]
We propose Vision Full-view prompt (VF prompt) to enhance autoregressive image generation.<n>Inspired by prompt engineering from the field of NLP, we propose Vision Full-view prompt (VF prompt) to enhance autoregressive image generation.
arXiv Detail & Related papers (2025-02-24T08:44:01Z)
RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning [54.07026389388881]
We present the first real-object-based retrieval-augmented generation framework (RealRAG)<n>RealRAG augments fine-grained and unseen novel object generation by learning and retrieving real-world images to overcome the knowledge gaps of generative models.<n>Our framework integrates fine-grained visual knowledge for the generative models, tackling the distortion problem and improving the realism for fine-grained object generation.
arXiv Detail & Related papers (2025-02-02T16:41:54Z)
Visual Autoregressive Modeling for Image Super-Resolution [14.935662351654601]
We propose a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction.<n>We collect large-scale data and design a training process to obtain robust generative priors.
arXiv Detail & Related papers (2025-01-31T09:53:47Z)
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [86.69947123512836]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z)
Timestep-Aware Diffusion Model for Extreme Image Rescaling [47.89362819768323]
We propose a novel framework called Timestep-Aware Diffusion Model (TADM) for extreme image rescaling.<n>TADM performs rescaling operations in the latent space of a pre-trained autoencoder.<n>It effectively leverages powerful natural image priors learned by a pre-trained text-to-image diffusion model.
arXiv Detail & Related papers (2024-08-17T09:51:42Z)
Semantic Guided Large Scale Factor Remote Sensing Image Super-resolution with Generative Diffusion Prior [13.148815217684277]
Large scale factor super-resolution (SR) algorithms are vital for maximizing the utilization of low-resolution (LR) satellite data captured from orbit. Existing methods confront challenges in recovering SR images with clear textures and correct ground objects. We introduce a novel framework, the Semantic Guided Diffusion Model (SGDM), designed for large scale factor remote sensing image super-resolution.
arXiv Detail & Related papers (2024-05-11T16:06:16Z)
In-Domain GAN Inversion for Faithful Reconstruction and Editability [132.68255553099834]
We propose in-domain GAN inversion, which consists of a domain-guided domain-regularized and a encoder to regularize the inverted code in the native latent space of the pre-trained GAN model. We make comprehensive analyses on the effects of the encoder structure, the starting inversion point, as well as the inversion parameter space, and observe the trade-off between the reconstruction quality and the editing property.
arXiv Detail & Related papers (2023-09-25T08:42:06Z)
A Generic Approach for Enhancing GANs by Regularized Latent Optimization [79.00740660219256]
We introduce a generic framework called em generative-model inference that is capable of enhancing pre-trained GANs effectively and seamlessly. Our basic idea is to efficiently infer the optimal latent distribution for the given requirements using Wasserstein gradient flow techniques.
arXiv Detail & Related papers (2021-12-07T05:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.