Self-Evaluation Unlocks Any-Step Text-to-Image Generation
- URL: http://arxiv.org/abs/2512.22374v1
- Date: Fri, 26 Dec 2025 20:42:11 GMT
- Title: Self-Evaluation Unlocks Any-Step Text-to-Image Generation
- Authors: Xin Yu, Xiaojuan Qi, Zhengqi Li, Kai Zhang, Richard Zhang, Zhe Lin, Eli Shechtman, Tianyu Wang, Yotam Nitzan,
- Abstract summary: We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation.<n>Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism.<n>Experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps.
- Score: 65.7088507945307
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.
Related papers
- Self-Improving LLM Agents at Test-Time [49.9396634315896]
One paradigm of language model (LM) fine-tuning relies on creating large training datasets.<n>In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive.<n>We study two variants of this approach: Test-Time Self-Improvement (TT-SI) and Test-Time Distillation (TT-D)
arXiv Detail & Related papers (2025-10-09T06:37:35Z) - Self-evolved Imitation Learning in Simulated World [16.459715139048367]
Self-Evolved Imitation Learning (SEIL) is a framework that progressively improves a few-shot model through simulator interactions.<n>SEIL achieves a new state-of-the-art performance in few-shot imitation learning scenarios.
arXiv Detail & Related papers (2025-09-23T18:15:32Z) - Align Your Flow: Scaling Continuous-Time Flow Map Distillation [63.927438959502226]
Flow maps connect any two noise levels in a single step and remain effective across all step counts.<n>We extensively validate our flow map models, called Align Your Flow, on challenging image generation benchmarks.<n>We show text-to-image flow map models that outperform all existing non-adversarially trained few-step samplers in text-conditioned synthesis.
arXiv Detail & Related papers (2025-06-17T15:06:07Z) - Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation [3.8959351616076745]
Flow matching has emerged as a promising framework for training generative models.<n>We introduce a self-corrected flow distillation method that integrates consistency models and adversarial training.<n>This work is a pioneer in achieving consistent generation quality in both few-step and one-step sampling.
arXiv Detail & Related papers (2024-12-22T07:48:49Z) - OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs [24.046764908874703]
OFTSR is a flow-based framework for one-step image super-resolution that can produce outputs with tunable levels of fidelity and realism.<n>We demonstrate that OFTSR achieves state-of-the-art performance for one-step image super-resolution, while having the ability to flexibly tune the fidelity-realism trade-off.
arXiv Detail & Related papers (2024-12-12T17:14:58Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.<n>Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.<n>We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - Class-Conditional self-reward mechanism for improved Text-to-Image models [1.8434042562191815]
We build upon the concept of self-rewarding models and introduce its vision equivalent for Text-to-Image generative AI models.
This approach works by fine-tuning diffusion model on a self-generated self-judged dataset.
It has been evaluated to be at least 60% better than existing commercial and research Text-to-image models.
arXiv Detail & Related papers (2024-05-22T09:28:43Z) - One-Step Diffusion Distillation via Deep Equilibrium Models [64.11782639697883]
We introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image.
Our method enables fully offline training with just noise/image pairs from the diffusion model.
We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5times$ larger ViT in terms of FID scores.
arXiv Detail & Related papers (2023-12-12T07:28:40Z) - SelfEval: Leveraging the discriminative nature of generative models for evaluation [30.239717220862143]
We present an automated way to evaluate the text alignment of text-to-image generative diffusion models.<n>Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts.
arXiv Detail & Related papers (2023-11-17T18:58:16Z) - Learning Rich Nearest Neighbor Representations from Self-supervised
Ensembles [60.97922557957857]
We provide a framework to perform self-supervised model ensembling via a novel method of learning representations directly through gradient descent at inference time.
This technique improves representation quality, as measured by k-nearest neighbors, both on the in-domain dataset and in the transfer setting.
arXiv Detail & Related papers (2021-10-19T22:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.