Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
- URL: http://arxiv.org/abs/2511.16671v1
- Date: Thu, 20 Nov 2025 18:59:52 GMT
- Title: Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
- Authors: Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, Pheng-Ann Heng,
- Abstract summary: Thinking-while-Generating (TwiG) is the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process.<n>To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning, and reinforcement learning.
- Score: 79.31152006811438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
Related papers
- Two-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks [24.19752468668527]
Two Interactive Streams (TwInS) is a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks.<n>To eliminate the dependence on costly human-annotated correspondence ground truth, TwInS is equipped with a tailored semi-supervised training strategy.
arXiv Detail & Related papers (2026-02-14T04:11:19Z) - Inspiration Seeds: Learning Non-Literal Visual Combinations for Generative Exploration [13.00602873238112]
We propose Inspiration Seeds, a generative framework that shifts image generation from final execution to exploratory ideation.<n>We use CLIP Sparse Autoencoders to extract editing directions in CLIP latent space and isolate concept pairs.
arXiv Detail & Related papers (2026-02-09T13:00:16Z) - FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback [92.67587639164908]
We present FronTalk, a benchmark for front-end code generation with multi-modal feedback.<n>We focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues.<n> Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature.
arXiv Detail & Related papers (2025-12-05T23:28:09Z) - Interleaving Reasoning for Better Text-to-Image Generation [83.69082794730664]
We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis.<n>To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals.<n>Experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN.
arXiv Detail & Related papers (2025-09-08T17:56:23Z) - Image Content Generation with Causal Reasoning [17.89980837508069]
ChatGPT has once again sparked research in generative artificial intelligence (GAI)
In visual modality, there is currently no equivalent research.
We propose a new image generation task called visual question answering with image (VQAI)
arXiv Detail & Related papers (2023-12-12T10:07:16Z) - Momentum Decoding: Open-ended Text Generation As Graph Exploration [49.812280360794894]
Open-ended text generation with autoregressive language models (LMs) is one of the core tasks in natural language processing.
We formulate open-ended text generation from a new perspective, i.e., we view it as an exploration process within a directed graph.
We propose a novel decoding method -- textitmomentum decoding -- which encourages the LM to explore new nodes outside the current graph.
arXiv Detail & Related papers (2022-12-05T11:16:47Z) - Visualize Before You Write: Imagination-Guided Open-Ended Text
Generation [68.96699389728964]
We propose iNLG that uses machine-generated images to guide language models in open-ended text generation.
Experiments and analyses demonstrate the effectiveness of iNLG on open-ended text generation tasks.
arXiv Detail & Related papers (2022-10-07T18:01:09Z) - Survey of Hallucination in Natural Language Generation [69.9926849848132]
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies.
Deep learning based generation is prone to hallucinate unintended text, which degrades the system performance.
This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
arXiv Detail & Related papers (2022-02-08T03:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.