Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
- URL: http://arxiv.org/abs/2510.18457v2
- Date: Mon, 03 Nov 2025 03:48:04 GMT
- Title: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
- Authors: Tianci Bi, Xiaoyi Zhang, Yan Lu, Nanning Zheng,
- Abstract summary: Vision Foundation Model Variational Autoencoder (VFM-VAE) designed to resolve inherent tension between VFM's semantic focus and need for pixel-level fidelity.<n>Our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers)
- Score: 45.63522160275318
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
Related papers
- Modeling Cross-vision Synergy for Unified Large Vision Model [130.37489011094036]
PolyV is a unified large vision model that achieves cross-vision synergy at both the architectural and training levels.<n>PolyV consistently outperforms existing models, achieving over 10% average improvement over its backbone.
arXiv Detail & Related papers (2026-03-03T22:44:43Z) - DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation [47.409626500688866]
We present the DINO Spherical Autoencoder (DINO-SAE), a framework that bridges semantic representation and pixel-level reconstruction.<n>Our approach achieves state-of-the-art reconstruction quality, reaching 0.37 rFID and 26.2 dB PSNR, while maintaining strong semantic alignment to the pretrained VFM.
arXiv Detail & Related papers (2026-01-30T12:25:34Z) - AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors [6.6016630449883955]
AnomalyVFM is a framework that turns any pretrained VFM into a strong zero-shot anomaly detector.<n>Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism.<n>It achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points.
arXiv Detail & Related papers (2026-01-28T12:02:58Z) - Boosting Latent Diffusion Models via Disentangled Representation Alignment [23.13416934016185]
In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models as representation alignment targets for VAEs.<n>We propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning.<n>Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics.
arXiv Detail & Related papers (2026-01-09T14:54:30Z) - Rethinking Infrared Small Target Detection: A Foundation-Driven Efficient Paradigm [17.63632082331749]
Large-scale visual foundation models (VFMs) exhibit strong generalization across diverse visual domains, but their potential for single-frame infrared small target (SIRST) detection remains largely unexplored.<n>We propose a Foundation-Driven Efficient Paradigm (FDEP) which can seamlessly adapt to existing encoder-decoder-based methods and significantly improve accuracy without additional inference overhead.
arXiv Detail & Related papers (2025-12-05T08:12:35Z) - Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency [60.74505433956616]
continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion.<n>We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks.<n>We propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer.
arXiv Detail & Related papers (2025-10-09T16:45:30Z) - Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion [12.839049648094893]
coronary artery segmentation is critical for computeraided diagnosis of coronary artery disease (CAD)<n>We propose a novel framework that leverages the power of vision foundation models (VFMs) through a parallel encoding architecture.<n>The proposed framework significantly outperforms state-of-the-art methods, achieving superior performance in accurate coronary artery segmentation.
arXiv Detail & Related papers (2025-07-17T09:25:00Z) - Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation [66.73899356886652]
We build an image tokenizer directly atop pre-trained vision foundation models.<n>Our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality.<n>It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks.
arXiv Detail & Related papers (2025-07-11T09:32:45Z) - Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL [70.1326027641056]
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks.<n>We propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions.<n>We present a two-stage training pipeline, including supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-21T12:18:15Z) - Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models [33.519892081718716]
We propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers.<n>Our proposed VA-VAE significantly expands the reconstruction-generation frontier of latent diffusion models.<n>We build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
arXiv Detail & Related papers (2025-01-02T18:59:40Z) - Model Inversion Attacks Through Target-Specific Conditional Diffusion Models [54.69008212790426]
Model inversion attacks (MIAs) aim to reconstruct private images from a target classifier's training set, thereby raising privacy concerns in AI applications.
Previous GAN-based MIAs tend to suffer from inferior generative fidelity due to GAN's inherent flaws and biased optimization within latent space.
We propose Diffusion-based Model Inversion (Diff-MI) attacks to alleviate these issues.
arXiv Detail & Related papers (2024-07-16T06:38:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.