Related papers: Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

URL: http://arxiv.org/abs/2511.20821v1
Date: Tue, 25 Nov 2025 20:20:21 GMT
Title: Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion
Authors: Samuele Dell'Erba, Andrew D. Bagdanov,
Abstract summary: Optimization-based Visual Inversion (OVI) is a training-free and data-free alternative to a prior.<n>OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimize it to maximize cosine similarity.<n>Experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors.
Score: 11.905134977931075
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

Related papers

TextBFGS: Quasi-Newton Optimization for Discrete Executable Text via Gradient-Operator Retrieval [38.3962427355446]
We introduce TextBFGS, a second-order framework to implement a Quasi-Newton optimization method for discrete text.<n>TextBFGS approximates the inverse Hessian matrix by retrieving gradient-Operators from the memory of pre-learned successful trajectories.<n>It achieves superior pass rates with fewer model calls and exhibits strong cross-task transferability.
arXiv Detail & Related papers (2026-01-20T05:45:56Z)
Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference [2.8292841621378844]
We propose an adaptive visual preprocessing method that adjusts input resolution and spatial coverage based on image content characteristics.<n>The proposed approach combines content-aware image analysis, adaptive resolution selection, and content-aware cropping to reduce visual redundancy prior to vision encoding.<n> Experimental results show that adaptive preprocessing reduces per-image inference time by over 50%, mean full generation time, and achieves a consistent reduction of more than 55% in visual token count.
arXiv Detail & Related papers (2025-12-23T23:30:56Z)
Towards Scalable Pre-training of Visual Tokenizers for Generation [41.785568766118594]
We present a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses.<n>After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods.
arXiv Detail & Related papers (2025-12-15T18:59:54Z)
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling [59.088798018184235]
textbfRAPO++ is a cross-stage prompt optimization framework.<n>It unifies training-data-aligned refinement, test-time iterative scaling, and large language model fine-tuning.<n> RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility.
arXiv Detail & Related papers (2025-10-23T04:45:09Z)
Infusing fine-grained visual knowledge to Vision-Language Models [5.487134463783365]
Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs)<n>We propose a fine-tuning method explicitly designed to achieve optimal balance between fine-grained domain adaptation and retention of the pretrained VLM's broad multimodal knowledge.<n>Our approach consistently achieves strong results, notably retaining the visual-text alignment without utilizing any text data or the original text encoder during fine-tuning.
arXiv Detail & Related papers (2025-08-16T19:12:09Z)
How to Use Diffusion Priors under Sparse Views? [29.738350228085928]
Inline Prior Guided Score Matching is proposed to provide visual supervision over sparse views in 3D reconstruction.<n>We show that our method achieves state-of-the-art reconstruction quality.
arXiv Detail & Related papers (2024-12-03T07:31:54Z)
PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation [10.856377349228927]
We argue that language prior can enhance monocular depth estimation by leveraging the inductive bias learned during the text-to-image pre-training of diffusion models.<n>We propose PriorDiffusion, using a pre-trained text-to-image diffusion model that takes both images and corresponding text descriptions to infer affine-invariant depth.
arXiv Detail & Related papers (2024-11-24T05:07:10Z)
Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.<n>By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z)
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings [61.9257731511557]
We propose Text Guided LLaVA (TG-LLaVA) to optimize vision-language models (VLMs) We use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance. With the guidance of text, the vision encoder can extract text-related features, similar to how humans focus on the most relevant parts of an image when considering a question.
arXiv Detail & Related papers (2024-09-15T00:38:34Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation [33.41226268323332]
Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network. We propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework.
arXiv Detail & Related papers (2023-07-08T06:49:54Z)
Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z)
VA-DepthNet: A Variational Approach to Single Image Depth Prediction [163.14849753700682]
VA-DepthNet is a simple, effective, and accurate deep neural network approach for the single-image depth prediction problem. The paper demonstrates the usefulness of the proposed approach via extensive evaluation and ablation analysis over several benchmark datasets.
arXiv Detail & Related papers (2023-02-13T17:55:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.