CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation
- URL: http://arxiv.org/abs/2601.10061v1
- Date: Thu, 15 Jan 2026 04:33:06 GMT
- Title: CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation
- Authors: Chengzhuo Tong, Mingkun Chang, Shenglong Zhang, Yuran Wang, Cheng Liang, Zhizheng Zhao, Ruichuan An, Bohan Zeng, Yang Shi, Yifan Dai, Ziming Zhao, Guanbin Li, Pengfei Wan, Yuanxing Zhang, Wentao Zhang,
- Abstract summary: Chain-of-Frame (CoF) reasoning enables frame-by-frame visual inference.<n>CoF-T2I integrates CoF reasoning into text-to-image (T2I) generation via progressive visual refinement.<n>Experiments show that CoF-T2I significantly outperforms the base video model.
- Score: 52.0601996237501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.
Related papers
- VIPER: Process-aware Evaluation for Generative Video Reasoning [64.86465792516658]
We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning.<n>Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking.
arXiv Detail & Related papers (2025-12-31T16:31:59Z) - Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models [76.7535001311919]
State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they often fail to compose complex scenes or follow logical temporal instructions.<n>We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages.<n>Our approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2.
arXiv Detail & Related papers (2025-12-18T10:10:45Z) - Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis [14.980220974022982]
We introduce EVS, a training-free Encapsulated Video Synthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness.<n>Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames.<n>We also employ T2V backbones to ensure consistent motion dynamics.
arXiv Detail & Related papers (2025-07-18T08:59:02Z) - Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction [36.82594554832902]
Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames.<n>We propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA)<n>We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition.
arXiv Detail & Related papers (2025-03-17T09:06:21Z) - Towards Enhanced Image Generation Via Multi-modal Chain of Thought in Unified Generative Models [52.84391764467939]
Unified generative models have shown remarkable performance in text and image generation.<n>We introduce Chain of Thought (CoT) into unified generative models to address the challenges of complex image generation.<n>Experiments show that FoX consistently outperforms existing unified models on various T2I benchmarks.
arXiv Detail & Related papers (2025-03-03T08:36:16Z) - STIV: Scalable Text and Image Conditioned Video Generation [82.6516473906985]
We present a simple and scalable text-image-conditioned video generation method, named STIV.<n>Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning.<n> STIV can be easily extended to various applications, such as video prediction, frame, multi-view generation, and long video generation.
arXiv Detail & Related papers (2024-12-10T18:27:06Z) - LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion
Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis.
We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z) - LeftRefill: Filling Right Canvas based on Left Reference through
Generalized Text-to-Image Diffusion Model [55.20469538848806]
LeftRefill is an innovative approach to harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
arXiv Detail & Related papers (2023-05-19T10:29:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.