Scalable Adaptive Computation for Iterative Generation
- URL: http://arxiv.org/abs/2212.11972v2
- Date: Wed, 14 Jun 2023 03:32:57 GMT
- Title: Scalable Adaptive Computation for Iterative Generation
- Authors: Allan Jabri, David Fleet, Ting Chen
- Abstract summary: Recurrent Interface Networks (RINs) are an attention-based architecture that decouples its core computation from the dimensionality of the data.
RINs focus the bulk of computation on a set of latent tokens, using cross-attention to read and write information between latent and data tokens.
RINs yield state-of-the-art pixel diffusion models for image and video generation, scaling to 1024X1024 images without cascades or guidance.
- Score: 13.339848496653465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural data is redundant yet predominant architectures tile computation
uniformly across their input and output space. We propose the Recurrent
Interface Networks (RINs), an attention-based architecture that decouples its
core computation from the dimensionality of the data, enabling adaptive
computation for more scalable generation of high-dimensional data. RINs focus
the bulk of computation (i.e. global self-attention) on a set of latent tokens,
using cross-attention to read and write (i.e. route) information between latent
and data tokens. Stacking RIN blocks allows bottom-up (data to latent) and
top-down (latent to data) feedback, leading to deeper and more expressive
routing. While this routing introduces challenges, this is less problematic in
recurrent computation settings where the task (and routing problem) changes
gradually, such as iterative generation with diffusion models. We show how to
leverage recurrence by conditioning the latent tokens at each forward pass of
the reverse diffusion process with those from prior computation, i.e. latent
self-conditioning. RINs yield state-of-the-art pixel diffusion models for image
and video generation, scaling to 1024X1024 images without cascades or guidance,
while being domain-agnostic and up to 10X more efficient than 2D and 3D U-Nets.
Related papers
- Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models [26.926712014346432]
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization.
Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512.
arXiv Detail & Related papers (2024-06-13T17:59:58Z) - AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation [51.143540967290114]
We propose a method that unlocks a wide range of previously-infeasible geometric augmentations for unsupervised depth computation and estimation.
This is achieved by reversing, or undo''-ing, geometric transformations to the coordinates of the output depth, warping the depth map back to the original reference frame.
arXiv Detail & Related papers (2023-10-15T05:15:45Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - HyperTime: Implicit Neural Representation for Time Series [131.57172578210256]
Implicit neural representations (INRs) have recently emerged as a powerful tool that provides an accurate and resolution-independent encoding of data.
In this paper, we analyze the representation of time series using INRs, comparing different activation functions in terms of reconstruction accuracy and training convergence speed.
We propose a hypernetwork architecture that leverages INRs to learn a compressed latent representation of an entire time series dataset.
arXiv Detail & Related papers (2022-08-11T14:05:51Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - DCT-Former: Efficient Self-Attention with Discrete Cosine Transform [4.622165486890318]
An intrinsic limitation of the Trasformer architectures arises from the computation of the dot-product attention.
Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module.
An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time.
arXiv Detail & Related papers (2022-03-02T15:25:27Z) - Deep Neural Networks are Surprisingly Reversible: A Baseline for
Zero-Shot Inversion [90.65667807498086]
This paper presents a zero-shot direct model inversion framework that recovers the input to the trained model given only the internal representation.
We empirically show that modern classification models on ImageNet can, surprisingly, be inverted, allowing an approximate recovery of the original 224x224px images from a representation after more than 20 layers.
arXiv Detail & Related papers (2021-07-13T18:01:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.