Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
- URL: http://arxiv.org/abs/2601.05124v1
- Date: Thu, 08 Jan 2026 17:13:00 GMT
- Title: Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
- Authors: Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, Chunyu Wang, Qinglin Lu, Jizhong Han, Jiao Dai,
- Abstract summary: Re-Align bridges the gap between understanding and generation through structured reasoning-guided alignment.<n>In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts.
- Score: 38.240269144736224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.
Related papers
- Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation [22.845591588026366]
We propose a framework that provides explicit, structured supervision from high-level concepts to fine-grained appearances.<n>At the conceptual level, we introduce a VAE dropout training strategy that randomly omits reference VAE features.<n>At the appearance level, we integrate the VLM-derived correspondences into a correspondence-aware masked attention module.
arXiv Detail & Related papers (2026-02-03T12:13:29Z) - GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation [51.95701097588426]
We introduce a Global Perspective Tokenizer (GloTok) to model a more uniform semantic distribution of tokenized features.<n>A residual learning module is proposed to recover the fine-grained details to minimize the reconstruction error caused by quantization.<n>Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
arXiv Detail & Related papers (2025-11-18T06:40:26Z) - Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing [16.943575863059607]
Image-POSER orchestrates a diverse registry of pretrained text-to-image and image-to-image experts.<n>It handles long-form prompts end-to-end through dynamic task decomposition.<n>It is consistently preferred in human evaluations.
arXiv Detail & Related papers (2025-11-15T03:15:34Z) - High Fidelity Text to Image Generation with Contrastive Alignment and Structural Guidance [0.0]
This paper addresses the performance of existing text-driven image generation methods in terms of semantic alignment accuracy and structural consistency.<n>A high-fidelity image generation method is proposed by integrating text-image contrastive constraints with structural guidance mechanisms.<n>The results show that the method effectively bridges the gap between semantic alignment and structural fidelity without increasing computational complexity.
arXiv Detail & Related papers (2025-08-14T02:15:11Z) - ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning [76.2503352325492]
ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm.<n>Latent semantics from control images are mined to enrich text prompts.<n>This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications.
arXiv Detail & Related papers (2025-06-04T05:56:19Z) - Unified Autoregressive Visual Generation and Understanding with Continuous Tokens [52.21981295470491]
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding.<n>Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image.<n>We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other.
arXiv Detail & Related papers (2025-03-17T17:58:30Z) - Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning [40.06403155373455]
We propose a novel reinforcement learning framework for personalized text-to-image generation.
Our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment.
arXiv Detail & Related papers (2024-07-09T08:11:53Z) - Energy-Based Cross Attention for Bayesian Context Update in
Text-to-Image Diffusion Models [62.603753097900466]
We present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors.
Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder.
Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts.
arXiv Detail & Related papers (2023-06-16T14:30:41Z) - IR-GAN: Image Manipulation with Linguistic Instruction by Increment
Reasoning [110.7118381246156]
Increment Reasoning Generative Adversarial Network (IR-GAN) aims to reason consistency between visual increment in images and semantic increment in instructions.
First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment.
Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary.
arXiv Detail & Related papers (2022-04-02T07:48:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.