DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation
- URL: http://arxiv.org/abs/2601.22904v1
- Date: Fri, 30 Jan 2026 12:25:34 GMT
- Title: DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation
- Authors: Hun Chang, Byunghee Cha, Jong Chul Ye,
- Abstract summary: We present the DINO Spherical Autoencoder (DINO-SAE), a framework that bridges semantic representation and pixel-level reconstruction.<n>Our approach achieves state-of-the-art reconstruction quality, reaching 0.37 rFID and 26.2 dB PSNR, while maintaining strong semantic alignment to the pretrained VFM.
- Score: 47.409626500688866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the DINO Spherical Autoencoder (DINO-SAE), a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that semantic information in contrastive representations is primarily encoded in the direction of feature vectors, while forcing strict magnitude matching can hinder the encoder from preserving fine-grained details. To address this, we introduce Hierarchical Convolutional Patch Embedding module that enhances local structure and texture preservation, and Cosine Similarity Alignment objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention. Furthermore, leveraging the observation that SSL-based foundation model representations intrinsically lie on a hypersphere, we employ Riemannian Flow Matching to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Experiments on ImageNet-1K demonstrate that our approach achieves state-of-the-art reconstruction quality, reaching 0.37 rFID and 26.2 dB PSNR, while maintaining strong semantic alignment to the pretrained VFM. Notably, our Riemannian Flow Matching-based DiT exhibits efficient convergence, achieving a gFID of 3.47 at 80 epochs.
Related papers
- Improving Reconstruction of Representation Autoencoder [52.817427902597416]
We propose LV-RAE, a representation autoencoder that augments semantic features with missing low-level information.<n>Our experiments demonstrate that LV-RAE significantly improves reconstruction fidelity while preserving the semantic abstraction.
arXiv Detail & Related papers (2026-02-09T13:12:35Z) - RecTok: Reconstruction Distillation along Rectified Flow [85.51292475005151]
We propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations.<n>Our method distills the semantic information in VFMs into the forward flow trajectories in flow matching.<n>Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance.
arXiv Detail & Related papers (2025-12-15T15:14:20Z) - Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition [51.03674130115878]
We introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture.<n>KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios.
arXiv Detail & Related papers (2025-10-23T07:12:26Z) - Manifold Decoders: A Framework for Generative Modeling from Nonlinear Embeddings [0.0]
We introduce a system- atic framework for constructing neural decoder architectures for prominent NLDR methods.<n>We extend this framework by implementing a diffusion-based generative process that operates directly within these learned manifold spaces.<n>Our findings reveal a fundamental trade-off: while the decoders successfully reconstruct data, their quality is surpassed by end-to-end optimized autoencoders.
arXiv Detail & Related papers (2025-10-15T14:50:51Z) - Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models [37.59115132356727]
We propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation.<n>On ImageNet 256$times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs.<n>Our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.
arXiv Detail & Related papers (2025-09-29T17:57:39Z) - Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation [52.261584726401686]
We present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model.<n>Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality.
arXiv Detail & Related papers (2025-07-11T09:32:45Z) - TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation [41.909091496502704]
Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models.<n>We propose TIDE-Temporal-aware sparse autoencoders for Interpretable Diffusion transformErs framework.
arXiv Detail & Related papers (2025-03-10T08:35:51Z) - UGMAE: A Unified Framework for Graph Masked Autoencoders [67.75493040186859]
We propose UGMAE, a unified framework for graph masked autoencoders.
We first develop an adaptive feature mask generator to account for the unique significance of nodes.
We then design a ranking-based structure reconstruction objective joint with feature reconstruction to capture holistic graph information.
arXiv Detail & Related papers (2024-02-12T19:39:26Z) - VQ-NeRF: Vector Quantization Enhances Implicit Neural Representations [25.88881764546414]
VQ-NeRF is an efficient pipeline for enhancing implicit neural representations via vector quantization.
We present an innovative multi-scale NeRF sampling scheme that concurrently optimize the NeRF model at both compressed and original scales.
We incorporate a semantic loss function to improve the geometric fidelity and semantic coherence of our 3D reconstructions.
arXiv Detail & Related papers (2023-10-23T01:41:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.