Related papers: Implicit Inversion turns CLIP into a Decoder

Implicit Inversion turns CLIP into a Decoder

URL: http://arxiv.org/abs/2505.23161v2
Date: Wed, 04 Jun 2025 10:30:14 GMT
Title: Implicit Inversion turns CLIP into a Decoder
Authors: Antonio D'Orazio, Maria Rosaria Briglia, Donato Crisostomi, Dario Loi, Emanuele Rodolà, Iacopo Masi,
Abstract summary: We show that image synthesis is possible using CLIP alone -- without any decoder, training, or fine-tuning.<n>Our approach optimize a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying across network layers.<n>Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction.
Score: 15.428694454730541
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.

Related papers

Robustness in Both Domains: CLIP Needs a Robust Text Encoder [65.42617172921975]
LEAF is an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models.<n>Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders.<n>We show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.
arXiv Detail & Related papers (2025-06-03T19:57:09Z)
Spectral Image Tokenizer [21.84385276311364]
Image tokenizers map images to sequences of discrete tokens.<n>We propose to tokenize the image spectrum instead, obtained from a discrete wavelet transform (DWT)<n>We evaluate the tokenizer metrics as multiscale image generation, text-guided image upsampling and editing.
arXiv Detail & Related papers (2024-12-12T18:59:31Z)
Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z)
In-Domain GAN Inversion for Faithful Reconstruction and Editability [132.68255553099834]
We propose in-domain GAN inversion, which consists of a domain-guided domain-regularized and a encoder to regularize the inverted code in the native latent space of the pre-trained GAN model. We make comprehensive analyses on the effects of the encoder structure, the starting inversion point, as well as the inversion parameter space, and observe the trade-off between the reconstruction quality and the editing property.
arXiv Detail & Related papers (2023-09-25T08:42:06Z)
CLIP2GAN: Towards Bridging Text with the Latent Space of GANs [128.47600914674985]
We propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN. The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN.
arXiv Detail & Related papers (2022-11-28T04:07:17Z)
Hierarchical Text-Conditional Image Generation with CLIP Latents [20.476720970770128]
We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
arXiv Detail & Related papers (2022-04-13T01:10:33Z)
No Token Left Behind: Explainability-Aided Image Classification and Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input. Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z)
CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields [33.43993665841577]
We present CLIP-NeRF, a multi-modal 3D object manipulation method for neural radiance fields (NeRF) We propose a unified framework that allows manipulating NeRF in a user-friendly way, using either a short text prompt or an exemplar image. We evaluate our approach by extensive experiments on a variety of text prompts and exemplar images.
arXiv Detail & Related papers (2021-12-09T18:59:55Z)
Modeling Lost Information in Lossy Image Compression [72.69327382643549]
Lossy image compression is one of the most commonly used operators for digital images. We propose a novel invertible framework called Invertible Lossy Compression (ILC) to largely mitigate the information loss problem.
arXiv Detail & Related papers (2020-06-22T04:04:56Z)
Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation [181.08127307338654]
This work presents an effective way to exploit the image prior captured by a generative adversarial network (GAN) trained on large-scale natural images. The deep generative prior (DGP) provides compelling results to restore missing semantics, e.g., color, patch, resolution, of various degraded images.
arXiv Detail & Related papers (2020-03-30T17:45:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.