Related papers: Masked Generative Nested Transformers with Decode Time Scaling

Masked Generative Nested Transformers with Decode Time Scaling

URL: http://arxiv.org/abs/2502.00382v1
Date: Sat, 01 Feb 2025 09:41:01 GMT
Title: Masked Generative Nested Transformers with Decode Time Scaling
Authors: Sahil Goyal, Debapriya Tula, Gagan Jain, Pradeep Shenoy, Prateek Jain, Sujoy Paul,
Abstract summary: In this work, we aim to address the bottleneck of inference computational efficiency in visual generation algorithms.<n>We design a decode time model scaling schedule to utilize compute effectively, and we can cache and reuse some of the computation.<n>Our experiments show that with almost $3times$ less compute than baseline, our model obtains competitive performance.
Score: 21.34984197218021
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in visual generation have made significant strides in producing content of exceptional quality. However, most methods suffer from a fundamental problem - a bottleneck of inference computational efficiency. Most of these algorithms involve multiple passes over a transformer model to generate tokens or denoise inputs. However, the model size is kept consistent throughout all iterations, which makes it computationally expensive. In this work, we aim to address this issue primarily through two key ideas - (a) not all parts of the generation process need equal compute, and we design a decode time model scaling schedule to utilize compute effectively, and (b) we can cache and reuse some of the computation. Combining these two ideas leads to using smaller models to process more tokens while large models process fewer tokens. These different-sized models do not increase the parameter size, as they share parameters. We rigorously experiment with ImageNet256$\times$256 , UCF101, and Kinetics600 to showcase the efficacy of the proposed method for image/video generation and frame prediction. Our experiments show that with almost $3\times$ less compute than baseline, our model obtains competitive performance.

Related papers

TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training [23.54555663670558]
This work aims to improve the training efficiency of the diffusion backbone by using predefined routes that store this information until it is reintroduced to deeper layers of the model.<n>Unlike most current approaches, TREAD achieves this without architectural modifications.<n>We show that our method reduces the computational cost and simultaneously boosts model performance on the standard benchmark ImageNet-1K 256 x 256 in class-conditional synthesis.
arXiv Detail & Related papers (2025-01-08T18:38:25Z)
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework.<n>CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales.<n>CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z)
Iterative Object Count Optimization for Text-to-image Diffusion Models [59.03672816121209]
Current models, which learn from image-text pairs, inherently struggle with counting. We propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object's potential. We evaluate the generation of various objects and show significant improvements in accuracy.
arXiv Detail & Related papers (2024-08-21T15:51:46Z)
FrameQuant: Flexible Low-Bit Quantization for Transformers [25.569106620123346]
Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower. We show, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains.
arXiv Detail & Related papers (2024-03-10T04:01:49Z)
PaSS: Parallel Speculative Sampling [29.23180061749074]
Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory. We show promising performance (up to $30%$ speed-up) while requiring only as few as $O(d_emb)$ additional parameters.
arXiv Detail & Related papers (2023-11-22T18:37:27Z)
Decreasing the Computing Time of Bayesian Optimization using Generalizable Memory Pruning [56.334116591082896]
We show a wrapper of memory pruning and bounded optimization capable of being used with any surrogate model and acquisition function. Running BO on high-dimensional or massive data sets becomes intractable due to this time complexity. All model implementations are run on the MIT Supercloud state-of-the-art computing hardware.
arXiv Detail & Related papers (2023-09-08T14:05:56Z)
A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies [0.0]
We compare existing solutions to long-sequence modeling in terms of their pure mathematical formulation. We then demonstrate that long context length does yield better performance, albeit application-dependent. Inspired by emerging sparse models of huge capacity, we propose a machine learning system for handling million-scale dependencies.
arXiv Detail & Related papers (2023-02-13T09:47:31Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it. Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z)
CNNs for JPEGs: A Study in Computational Cost [49.97673761305336]
Convolutional neural networks (CNNs) have achieved astonishing advances over the past decade. CNNs are capable of learning robust representations of the data directly from the RGB pixels. Deep learning methods capable of learning directly from the compressed domain have been gaining attention in recent years.
arXiv Detail & Related papers (2020-12-26T15:00:10Z)
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.