Related papers: Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

URL: http://arxiv.org/abs/2312.14385v2
Date: Mon, 6 May 2024 03:54:58 GMT
Title: Generative AI Beyond LLMs: System Implications of Multi-Modal Generation
Authors: Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu,
Abstract summary: We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models. We additionally observe that Diffusion-based TTI models resemble the Prefill stage of LLM inference, and benefit from 1.1-2.5x greater speedup from Flash
Score: 12.827526286642282
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion- and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49% of execution time for Transformer-based models. We additionally observe that Diffusion-based TTI models resemble the Prefill stage of LLM inference, and benefit from 1.1-2.5x greater speedup from Flash Attention than Transformer-based TTI models that resemble the Decode phase. Since optimizations designed for LLMs do not map directly onto TTI/TTV models, we must conduct a thorough characterization of these workloads to gain insights for new optimization opportunities. In doing so, we define sequence length in the context of TTI/TTV models and observe sequence length can vary up to 4x in Diffusion model inference. We additionally observe temporal aspects of TTV workloads pose unique system bottlenecks, with Temporal Attention accounting for over 60% of total Attention time. Overall, our in-depth system performance characterization is a critical first step towards designing efficient and deployable systems for emerging TTI/TTV workloads.

Related papers

DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation [66.86241453156225]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs. We propose textbfDynamic textbfDiffusion textbfTransformer (DyDiT) DyDiT adjusts its computation along both emphtimestep and emphspatial dimensions.
arXiv Detail & Related papers (2025-04-09T11:48:37Z)
TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation [34.73820805875123]
TIDE (Temporal-aware Sparse Autoencoders for Interpretable Diffusion transformErs) is a novel framework that enhances temporal reconstruction within DiT activation layers across denoising steps. TIDE employs Sparse Autoencoders (SAEs) with a sparse bottleneck layer to extract interpretable and hierarchical features. Our approach achieves state-of-the-art reconstruction performance, with a mean squared error (MSE) of 1e-3 and a cosine similarity of 0.97.
arXiv Detail & Related papers (2025-03-10T08:35:51Z)
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models [33.519892081718716]
We propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE significantly expands the reconstruction-generation frontier of latent diffusion models. We build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
arXiv Detail & Related papers (2025-01-02T18:59:40Z)
Identity-Preserving Text-to-Video Generation by Frequency Decomposition [52.19475797580653]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature. We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video.
arXiv Detail & Related papers (2024-11-26T13:58:24Z)
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation. A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens. An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z)
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design [79.7289790249621]
Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals. We highlight the crucial importance of tailoring datasets to specific learning objectives. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver.
arXiv Detail & Related papers (2024-10-08T04:30:06Z)
Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs. We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z)
DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks. We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT) DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z)
Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage. We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.