CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation
- URL: http://arxiv.org/abs/2512.20362v1
- Date: Tue, 23 Dec 2025 13:44:41 GMT
- Title: CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation
- Authors: V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin,
- Abstract summary: CRAFT (Continuous Reasoning and Agentic Feedback Tuning) is a training-free, model-agnostic framework that brings structured reasoning paradigm to multimodal image generation.<n>It consistently improves compositional accuracy, text rendering, and preference-based evaluations.<n>These improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of **thinking** based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free, model-agnostic framework that brings this structured reasoning paradigm to multimodal image generation. CRAFT decomposes a prompt into dependency-structured visual questions, veries generated images using a vision-language model, and applies targeted prompt edits through an LLM agent only where constraints fail. The process iterates with an explicit stopping criterion once all constraints are satised, yielding an interpretable and controllable inference-time renement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.
Related papers
- UniT: Unified Multimodal Chain-of-Thought Test-time Scaling [85.590774707406]
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs.<n>We introduce UniT, a framework for multimodal test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds.
arXiv Detail & Related papers (2026-02-12T18:59:49Z) - Beyond Output Critique: Self-Correction via Task Distillation [36.44752912823049]
We propose a framework that introduces an intermediate step of task abstraction before solution refinement.<n>Given an input and an initial response, the model first distills the task into a structured template that captures key variables, constraints, and problem structure.<n>This abstraction then guides solution instantiation, grounding subsequent responses in a clearer understanding of the task.
arXiv Detail & Related papers (2026-01-31T19:15:41Z) - LP-LLM: End-to-End Real-World Degraded License Plate Text Recognition via Large Multimodal Models [4.497411606350301]
Real-world License Plate Recognition (LPR) faces significant challenges from severe degradations such as motion blur, low resolution, and complex illumination.<n>The prevailing "restoration-then-recognition" two-stage paradigm suffers from a fundamental flaw: the pixel-level optimization objectives of image restoration models are misaligned with the semantic goals of character recognition.<n>We propose an end-to-end structure-aware multimodal reasoning framework based on Qwen3-VL.
arXiv Detail & Related papers (2026-01-14T03:32:55Z) - ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing [33.888289858260706]
Reinforcement learning (RL) has been investigated for improving the quality of image editing.<n>RL faces three key challenges: (1) limited reasoning exploration confined to denoising, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards.<n>We propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis.
arXiv Detail & Related papers (2026-01-06T23:43:00Z) - The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment [105.31858867473845]
ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing.<n>In experiments, ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.
arXiv Detail & Related papers (2025-11-25T18:40:25Z) - ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation [49.01601313084479]
ImAgent is a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation.<n>Experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone.
arXiv Detail & Related papers (2025-11-14T17:00:29Z) - CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning [93.05917922306196]
Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2025-10-09T09:41:45Z) - ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection [51.93101033997245]
Increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations.<n>We propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection.<n>We show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark.
arXiv Detail & Related papers (2025-09-24T07:34:09Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - Autoregressive Image Generation with Vision Full-view Prompt [18.569610688433745]
We propose Vision Full-view prompt (VF prompt) to enhance autoregressive image generation.<n>Inspired by prompt engineering from the field of NLP, we propose Vision Full-view prompt (VF prompt) to enhance autoregressive image generation.
arXiv Detail & Related papers (2025-02-24T08:44:01Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.<n>Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.<n>We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO [36.69910114305134]
We propose Iterative Self-Retrospective Direct Preference Optimization (ISR-DPO) to enhance preference modeling.<n>ISR-DPO enhances the self-judge's focus on informative video regions, resulting in more visually grounded preferences.<n>In extensive empirical evaluations, the ISR-DPO significantly outperforms the state of the art.
arXiv Detail & Related papers (2024-06-17T07:33:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.