Related papers: Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

URL: http://arxiv.org/abs/2510.14630v1
Date: Thu, 16 Oct 2025 12:43:03 GMT
Title: Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
Authors: Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, Björn Ommer,
Abstract summary: RepTok is a generative modeling framework that represents an image using a single continuous latent token.<n>RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis.
Score: 18.746963205066688
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

Related papers

Learning Sparse Visual Representations via Spatial-Semantic Factorization [37.169502692169196]
Self-supervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction.<n>We introduce STELLAR, a framework that factorizes visual features into a low-rank product of semantic concepts and their spatial distributions.<n>We demonstrate that as few as 16 sparse tokens under this factorized form are sufficient to simultaneously support high-quality reconstruction (2.60 FID) and match the semantic performance of dense backbones (79.10% ImageNet accuracy)
arXiv Detail & Related papers (2026-02-02T10:12:17Z)
Multi-Scale Local Speculative Decoding for Image Generation [10.239314110594249]
We introduce Multi-Scale Local Speculative Decoding (MuLo-SD)<n>MuLo-SD combines multi-resolution drafting with spatially informed verification to accelerate AR image generation.<n>We demonstrate that MuLo-SD achieves substantial speedups up to $mathbf1.7times$.
arXiv Detail & Related papers (2026-01-08T17:39:35Z)
ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation [64.84095852784714]
Residual Tokenizer (ResTok) is a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens.<n>We show that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps.
arXiv Detail & Related papers (2026-01-07T14:09:18Z)
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing [62.94394079771687]
A burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents.<n>We propose a systematic framework to adapt understanding-oriented encoder features for generative tasks.<n>We show that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both Text-to-Image (T2I) and image editing tasks.
arXiv Detail & Related papers (2025-12-19T18:59:57Z)
SFTok: Bridging the Performance Gap in Discrete Tokenizers [72.9996757048065]
We propose textbfSFTok, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction.<n>At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet.
arXiv Detail & Related papers (2025-12-18T18:59:04Z)
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation [66.73899356886652]
We build an image tokenizer directly atop pre-trained vision foundation models.<n>Our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality.<n>It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks.
arXiv Detail & Related papers (2025-07-11T09:32:45Z)
LAFR: Efficient Diffusion-based Blind Face Restoration via Latent Codebook Alignment Adapter [52.93785843453579]
Blind face restoration from low-quality (LQ) images is a challenging task that requires high-fidelity image reconstruction and the preservation of facial identity.<n>We propose LAFR, a novel codebook-based latent space adapter that aligns the latent distribution of LQ images with that of HQ counterparts.<n>We show that lightweight finetuning of diffusion prior on just 0.9% of FFHQ dataset is sufficient to achieve results comparable to state-of-the-art methods.
arXiv Detail & Related papers (2025-05-29T14:11:16Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Open-Vocabulary Semantic Segmentation with Image Embedding Balancing [33.69721994194684]
We propose a novel framework for openvocabulary semantic segmentation called EBSeg. AdaB Decoder is designed to generate different image embeddings for both training and new classes. SSC Loss aligns the inter-classes affinity in the image feature space with that in the text feature space of CLIP.
arXiv Detail & Related papers (2024-06-14T08:34:20Z)
A Compact and Semantic Latent Space for Disentangled and Controllable Image Editing [4.8201607588546]
We propose an auto-encoder which re-organizes the latent space of StyleGAN, so that each attribute which we wish to edit corresponds to an axis of the new latent space. We show that our approach has greater disentanglement than competing methods, while maintaining fidelity to the original image with respect to identity.
arXiv Detail & Related papers (2023-12-13T16:18:45Z)
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency. The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on. Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z)
Momentum Contrastive Autoencoder: Using Contrastive Learning for Latent Space Distribution Matching in WAE [51.09507030387935]
Wasserstein autoencoder (WAE) shows that matching two distributions is equivalent to minimizing a simple autoencoder (AE) loss under the constraint that the latent space of this AE matches a pre-specified prior distribution. We propose to use the contrastive learning framework that has been shown to be effective for self-supervised representation learning, as a means to resolve this problem. We show that using the contrastive learning framework to optimize the WAE loss achieves faster convergence and more stable optimization compared with existing popular algorithms for WAE.
arXiv Detail & Related papers (2021-10-19T22:55:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.