Improving Chain-of-Thought Efficiency for Autoregressive Image Generation
- URL: http://arxiv.org/abs/2510.05593v1
- Date: Tue, 07 Oct 2025 05:40:43 GMT
- Title: Improving Chain-of-Thought Efficiency for Autoregressive Image Generation
- Authors: Zeqi Gu, Markos Georgopoulos, Xiaoliang Dai, Marjan Ghazvininejad, Chu Wang, Felix Juefei-Xu, Kunpeng Li, Yujun Shi, Zecheng He, Zijian He, Jiawei Zhou, Abe Davis, Jialiang Wang,
- Abstract summary: We introduce ShortCoTI, a lightweight optimization framework for image generation.<n>ShortCoTI rewards more concise prompts with an adaptive function that scales according to an estimated difficulty for each task.<n>Our method eliminates verbose explanations and repetitive refinements, producing reasoning prompts that are both concise and semantically rich.
- Score: 55.57836819892392
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive multimodal large language models have recently gained popularity for image generation, driven by advances in foundation models. To enhance alignment and detail, newer approaches employ chain-of-thought (CoT) reasoning, expanding user inputs into elaborated prompts prior to image synthesis. However, this strategy can introduce unnecessary redundancy -- a phenomenon we call visual overthinking -- which increases computational costs and can introduce details that contradict the original prompt. In this work, we explore how to generate more concise CoT sequences for more efficient image generation. We introduce ShortCoTI, a lightweight optimization framework that encourages more concise CoT while preserving output image quality. ShortCoTI rewards more concise prompts with an adaptive function that scales according to an estimated difficulty for each task. Incorporating this reward into a reinforcement learning paradigm reduces prompt reasoning length by 54% while maintaining or slightly improving quality metrics across multiple benchmarks (T2I-CompBench, GenEval). Qualitative analysis shows that our method eliminates verbose explanations and repetitive refinements, producing reasoning prompts that are both concise and semantically rich. As a result, ShortCoTI improves computational efficiency without compromising the fidelity or visual appeal of generated images.
Related papers
- ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models [33.09645476860831]
We propose ImageRAGTurbo, a novel approach to efficiently finetune few-step diffusion models via retrieval augmentation.<n>Given a text prompt, we retrieve relevant text-image pairs from a database and use them to condition the generation process.<n>Experiments show that our approach produces high-fidelity images without compromising latency compared to existing methods.
arXiv Detail & Related papers (2026-02-13T05:59:57Z) - Thinking with Images via Self-Calling Agent [43.48244527974193]
Self-Calling Chain-of-Thought (sCoT) is a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling.<n> Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9%$ with $sim 75%$ fewer GPU hours.
arXiv Detail & Related papers (2025-12-09T11:53:21Z) - HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning [66.99487505369254]
HiCoGen is built upon a novel Chain of Synthesis paradigm.<n>It decomposes complex prompts into minimal semantic units.<n>It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next.<n>Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.
arXiv Detail & Related papers (2025-11-25T06:24:25Z) - Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets [19.950913420708734]
We propose a training-free approach that clusters prompts based on semantic similarity and shares in early diffusion steps.<n>Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation.
arXiv Detail & Related papers (2025-08-28T17:35:03Z) - Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration [109.38288333994407]
Contrastive Prompt Learning (CPL) is a novel framework that fundamentally enhances prompt-task alignment.<n>Our framework establishes new state-of-the-art performance while maintaining parameter efficiency, offering a principled solution for unified image restoration.
arXiv Detail & Related papers (2025-04-14T08:24:57Z) - NFIG: Autoregressive Image Generation with Next-Frequency Prediction [50.69346038028673]
We present textbfNext-textbfFrequency textbfImage textbfGeneration (textbfNFIG), a novel framework that decomposes the image generation process into multiple frequency-guided stages.<n>Our approach first generates low-frequency components to establish global structure with fewer tokens, then progressively adds higher-frequency details, following the natural spectral hierarchy of images.
arXiv Detail & Related papers (2025-03-10T08:59:10Z) - Striving for Faster and Better: A One-Layer Architecture with Auto Re-parameterization for Low-Light Image Enhancement [50.93686436282772]
We aim to delve into the limits of image enhancers both from visual quality and computational efficiency.<n>By rethinking the task demands, we build an explicit connection, i.e., visual quality and computational efficiency are corresponding to model learning and structure design.<n>Ultimately, this achieves efficient low-light image enhancement using only a single convolutional layer, while maintaining excellent visual quality.
arXiv Detail & Related papers (2025-02-27T08:20:03Z) - Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [86.69947123512836]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z) - FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting [18.708185548091716]
FRAP is a simple, yet effective approach based on adaptively adjusting the per-token prompt weights.<n>We show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets.<n>We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment.
arXiv Detail & Related papers (2024-08-21T15:30:35Z) - Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning [45.517215214938844]
Chain-of-thought technique has been received well in multi-modal tasks.
We propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning.
arXiv Detail & Related papers (2024-04-06T07:39:44Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.