Related papers: Towards Scalable Pre-training of Visual Tokenizers for Generation

Towards Scalable Pre-training of Visual Tokenizers for Generation

URL: http://arxiv.org/abs/2512.13687v1
Date: Mon, 15 Dec 2025 18:59:54 GMT
Title: Towards Scalable Pre-training of Visual Tokenizers for Generation
Authors: Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang,
Abstract summary: We present a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses.<n>After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods.
Score: 41.785568766118594
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.

Related papers

Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification [0.0]
This study evaluates the accuracy and deployment feasibility of seven modern lightweight architectures: ConvNeXt-T, EfficientV2-S, MobileNetV3-L, MobileViT v2 (S/XS), RepVGG-A2, and TinyViT-21M.<n> tuning alone leads to a top-1 accuracy improvement of 1.5 to 3.5 percent over baselines, and select models deliver latency under 5 milliseconds and over 9,800 frames per second.
arXiv Detail & Related papers (2025-07-31T07:47:30Z)
Generative Pre-training for Subjective Tasks: A Diffusion Transformer-Based Framework for Facial Beauty Prediction [0.0]
Facial Beauty Prediction (FBP) is a challenging computer vision task due to its subjective nature and the subtle, holistic features that influence human perception.<n>In this paper, we propose a novel two-stage framework that leverages the power of generative models to create a superior, domain-specific feature extractor.<n>Our method, termed Diff-FBP, sets a new state-of-the-art on the FBP5500 benchmark, achieving a Pearson Correlation Coefficient (PCC) of 0.932, significantly outperforming prior art based on general-purpose pre-training.
arXiv Detail & Related papers (2025-07-27T17:33:51Z)
Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning [8.284127681482202]
'LVTP' is a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering.<n>It integrates high-level semantics and basic visual attributes for precise segmentation.<n>As a plug-and-play module, it requires no architectural changes or additional training.
arXiv Detail & Related papers (2025-04-25T00:43:20Z)
RLIPv2: Fast Scaling of Relational Language-Image Pre-training [53.21796397618875]
We propose RLIPv2, a fast converging model that enables the relational scaling of pre-training to large-scale pseudo-labelled scene graph data. Asymmetric Language-Image Fusion (ALIF) facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding. RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings.
arXiv Detail & Related papers (2023-08-18T07:17:09Z)
Pre-Pruning and Gradient-Dropping Improve Differentially Private Image Classification [9.120531252536617]
We introduce a new training paradigm that uses textitpre-pruning and textitgradient-dropping to reduce the parameter space and improve scalability. Our training paradigm introduces a tension between the rates of pre-pruning and gradient-dropping, privacy loss, and classification accuracy.
arXiv Detail & Related papers (2023-06-19T14:35:28Z)
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training. We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z)
FastMIM: Expediting Masked Image Modeling Pre-training for Vision [65.47756720190155]
FastMIM is a framework for pre-training vision backbones with low-resolution input images. It reconstructs Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images. It can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
arXiv Detail & Related papers (2022-12-13T14:09:32Z)
Simpler is Better: off-the-shelf Continual Learning Through Pretrained Backbones [0.0]
We propose a baseline (off-the-shelf) for Continual Learning of Computer Vision problems. We exploit the power of pretrained models to compute a class prototype and fill a memory bank. We compare our pipeline with common CNN models and show the superiority of Vision Transformers.
arXiv Detail & Related papers (2022-05-03T16:03:46Z)
Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID) Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance. This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z)
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures. This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference. We show that the improved robustness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z)
Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation. This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model. We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.