Related papers: Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction

Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction

URL: http://arxiv.org/abs/2405.13218v2
Date: Fri, 24 May 2024 13:58:09 GMT
Title: Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction
Authors: Maciej Kilian, Varun Jampani, Luke Zettlemoyer,
Abstract summary: Diffusion, masked-token prediction, and next-token prediction all use a Transformer network architecture. We analyze the scalability of each approach through the lens of compute budget measured in FLOPs. We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following.
Score: 79.78050867137594
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Nearly every recent image synthesis approach, including diffusion, masked-token prediction, and next-token prediction, uses a Transformer network architecture. Despite this common backbone, there has been no direct, compute controlled comparison of how these approaches affect performance and efficiency. We analyze the scalability of each approach through the lens of compute budget measured in FLOPs. We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following. On image quality, while next-token prediction initially performs better, scaling trends suggest it is eventually matched by diffusion. We compare the inference compute efficiency of each approach and find that next token prediction is by far the most efficient. Based on our findings we recommend diffusion for applications targeting image quality and low latency; and next-token prediction when prompt following or throughput is more important.

Related papers

BudgetFusion: Perceptually-Guided Adaptive Diffusion Models [15.293203074854267]
We present BudgetFusion, a novel model that suggests the most perceptually efficient number of diffusion steps before a diffusion model starts to generate an image. Experiments show that BudgetFusion saves up to five seconds per prompt without compromising perceptual similarity. We hope this work can initiate efforts toward answering a core question: how much do humans perceptually gain from images created by a generative model, per watt of energy?
arXiv Detail & Related papers (2024-12-08T02:23:40Z)
Channel-aware Contrastive Conditional Diffusion for Multivariate Probabilistic Time Series Forecasting [19.383395337330082]
We propose a generic channel-aware Contrastive Conditional Diffusion model entitled CCDM. The proposed CCDM can exhibit superior forecasting capability compared to current state-of-the-art diffusion forecasters.
arXiv Detail & Related papers (2024-10-03T03:13:15Z)
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction [29.834614425056355]
We introduce Lotus, a diffusion-based visual foundation model with a simple yet effective adaptation protocol for dense prediction. Specifically, Lotus is trained to directly predict annotations instead of noise, thereby avoiding harmful variance. We introduce a novel tuning strategy called detail preserver, which achieves more accurate and fine-grained predictions.
arXiv Detail & Related papers (2024-09-26T17:58:55Z)
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion [61.03681839276652]
Diffusion Forcing is a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens.
arXiv Detail & Related papers (2024-07-01T15:43:25Z)
TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z)
Exploiting Diffusion Prior for Generalizable Dense Prediction [85.4563592053464]
Recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf dense predictors to estimate. We introduce DMP, a pipeline utilizing pre-trained T2I models as a prior for dense prediction tasks. Despite limited-domain training data, the approach yields faithful estimations for arbitrary images, surpassing existing state-of-the-art algorithms.
arXiv Detail & Related papers (2023-11-30T18:59:44Z)
Efficient and Differentiable Conformal Prediction with General Function Classes [96.74055810115456]
We propose a generalization of conformal prediction to multiple learnable parameters. We show that it achieves approximate valid population coverage and near-optimal efficiency within class. Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly.
arXiv Detail & Related papers (2022-02-22T18:37:23Z)
Bayesian Graph Contrastive Learning [55.36652660268726]
We propose a novel perspective of graph contrastive learning methods showing random augmentations leads to encoders. Our proposed method represents each node by a distribution in the latent space in contrast to existing techniques which embed each node to a deterministic vector. We show a considerable improvement in performance compared to existing state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2021-12-15T01:45:32Z)
Aligned Contrastive Predictive Coding [10.521845940927163]
We investigate the possibility of forcing a self-supervised model trained using a contrastive predictive loss to extract slowly varying latent representations. Rather than producing individual predictions for each of the future representations, the model emits a sequence of predictions shorter than that of the upcoming representations to which they will be aligned.
arXiv Detail & Related papers (2021-04-24T13:07:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.