Related papers: Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

URL: http://arxiv.org/abs/2501.09755v1
Date: Thu, 16 Jan 2025 18:59:04 GMT
Title: Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
Authors: Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen,
Abstract summary: Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space.<n>Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank.<n>We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling.
Score: 30.942443676393584
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.

Related papers

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation [62.77721499671665]
We introduce GigaTok, the first approach to improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. By scaling to $bf3 space billion$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.
arXiv Detail & Related papers (2025-04-11T17:59:58Z)
Toward Lightweight and Fast Decoders for Diffusion Models in Image and Video Generation [0.0]
Large Variational Autoencoder decoders can slow down generation and consume considerable GPU memory. We propose custom-trained decoders using lightweight Vision Transformer and Taming Transformer architectures. Experiments show up to 15% overall speed-ups for image generation on COCO 2017 and up to 20 times faster decoding in the sub-module, with additional gains on UCF-101 for video tasks.
arXiv Detail & Related papers (2025-03-06T16:21:49Z)
Improving the Diffusability of Autoencoders [54.920783089085035]
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos. We perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality.
arXiv Detail & Related papers (2025-02-20T18:45:44Z)
$ε$-VAE: Denoising as Visual Decoding [61.29255979767292]
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (
arXiv Detail & Related papers (2024-10-05T08:27:53Z)
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation [74.15447383432262]
The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer.<n>We provide a tokenizer pre-trained on large-scale data, significantly outperforming Cosmos on zero-shot benchmarks.<n>We produce a family of auto-regressive image generation models ranging from 300M to 1.5B.
arXiv Detail & Related papers (2024-09-06T17:14:53Z)
An Efficient Implicit Neural Representation Image Codec Based on Mixed Autoregressive Model for Low-Complexity Decoding [43.43996899487615]
Implicit Neural Representation (INR) for image compression is an emerging technology that offers two key benefits compared to cutting-edge autoencoder models. We introduce a new Mixed AutoRegressive Model (MARM) to significantly reduce the decoding time for the current INR. MARM includes our proposed AutoRegressive Upsampler (ARU) blocks, which are highly efficient, and ARM from previous work to balance decoding time and reconstruction quality.
arXiv Detail & Related papers (2024-01-23T09:37:58Z)
ConvNeXt-ChARM: ConvNeXt-based Transform for Efficient Neural Image Compression [18.05997169440533]
We propose ConvNeXt-ChARM, an efficient ConvNeXt-based transform coding framework, paired with a compute-efficient channel-wise auto-regressive auto-regressive. We show that ConvNeXt-ChARM brings consistent and significant BD-rate (PSNR) reductions estimated on average to 5.24% and 1.22% over the versatile video coding (VVC) reference encoder (VTM-18.0) and the state-of-the-art learned image compression method SwinT-ChARM.
arXiv Detail & Related papers (2023-07-12T11:45:54Z)
VNVC: A Versatile Neural Video Coding Framework for Efficient Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels. We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z)
GAN-Based Multi-View Video Coding with Spatio-Temporal EPI Reconstruction [19.919826392704472]
We propose a novel multi-view video coding method that leverages the image generation capabilities of Generative Adrial Network (GAN) At the encoder, we construct atemporal Epipolar Plane Image (EPI) decoder and further utilize a convolutional network to extract the latent code of a GAN as Side Information (SI) At the side, we combine SI and adjacent viewpoints to reconstruct intermediate views using the GAN generator.
arXiv Detail & Related papers (2022-05-07T08:52:54Z)
ALAP-AE: As-Lite-as-Possible Auto-Encoder [6.244939945140818]
We present a novel algorithm to reduce tensor compute required by a conditional image generation autoencoder. We show performance gains for various conditional image generation tasks. We achieve real-time versions of various autoencoders on CPU-only devices while maintaining image quality.
arXiv Detail & Related papers (2022-03-19T18:03:08Z)
Restormer: Efficient Transformer for High-Resolution Image Restoration [118.9617735769827]
convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data. Transformers have shown significant performance gains on natural language and high-level vision tasks. Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks.
arXiv Detail & Related papers (2021-11-18T18:59:10Z)
Transformer-based Image Compression [18.976159633970177]
Transformer-based Image Compression (TIC) approach is developed which reuses the canonical variational autoencoder (VAE) architecture with paired main and hyper encoder-decoders. TIC rivals with state-of-the-art approaches including deep convolutional neural networks (CNNs) based learnt image coding (LIC) methods and handcrafted rules-based intra profile of recently-approved Versatile Video Coding (VVC) standard.
arXiv Detail & Related papers (2021-11-12T13:13:20Z)
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR) SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond Feature and Signal [99.49099501559652]
Video Coding for Machine (VCM) aims to bridge the gap between visual feature compression and classical video coding. We employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern. By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames.
arXiv Detail & Related papers (2020-01-09T14:18:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.