MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation
- URL: http://arxiv.org/abs/2410.02130v2
- Date: Thu, 13 Feb 2025 08:24:37 GMT
- Title: MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation
- Authors: Trung X. Pham, Tri Ton, Chang D. Yoo,
- Abstract summary: We introduce MDSGen, a novel framework for vision-guided open-domain sound generation.<n>MDSGen employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models.<n> Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves $97.9$% alignment accuracy.<n>Our larger model (131M parameters) reaches nearly $99$% accuracy while requiring $6.5times$ fewer parameters.
- Score: 21.242398582282522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, \texttt{MDSGen} employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves $97.9$% alignment accuracy, using $172\times$ fewer parameters, $371$% less memory, and offering $36\times$ faster inference than the current 860M-parameter state-of-the-art model ($93.9$% accuracy). The larger model (131M parameters) reaches nearly $99$% accuracy while requiring $6.5\times$ fewer parameters. These results highlight the scalability and effectiveness of our approach. The code is available at https://bit.ly/mdsgen.
Related papers
- Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model [60.171601995737646]
Mobile-VideoGPT is an efficient multimodal framework for video understanding.
It consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM)
Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second.
arXiv Detail & Related papers (2025-03-27T17:59:58Z) - E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization [20.441652320245975]
E-MD3C is a highly efficient framework for zero-shot object image customization.
Unlike prior works reliant on resource-intensive Unet architectures, our approach employs lightweight masked diffusion transformers.
E-MD3C outperforms the existing approach on the VITON-HD dataset across metrics such as PSNR, FID, SSIM, and LPIPS.
arXiv Detail & Related papers (2025-02-13T10:48:11Z) - Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation [30.05431858162078]
Text-to-motion (T2M) generation plays a significant role in various applications.
Current methods involve a large number of parameters and suffer from slow inference speeds.
We propose a lightweight and fast model named Light-T2M to reduce usage costs.
arXiv Detail & Related papers (2024-12-15T13:58:37Z) - Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs.
We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation.
With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z) - SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models.
We propose a novel model fine-tuning method to make full use of these ineffective parameters.
Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z) - Diffusion Model Patching via Mixture-of-Prompts [17.04227271007777]
Diffusion Model Patching (DMP) is a simple method to boost the performance of pre-trained diffusion models.
DMP inserts a small, learnable set of prompts into the model's input space while keeping the original model frozen.
arXiv Detail & Related papers (2024-05-28T04:47:54Z) - SparseDM: Toward Sparse Efficient Diffusion Models [20.783533300147866]
We propose a method based on the improved Straight-Through Estimator to improve the deployment efficiency of diffusion models.
Experimental results on a Transformer and UNet-based diffusion models demonstrate that our method reduces MACs by 50% while maintaining FID.
arXiv Detail & Related papers (2024-04-16T10:31:06Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - SeTformer is What You Need for Vision and Language [26.036537788653373]
Self-optimal Transport (SeT) is a novel transformer for achieving better performance and computational efficiency.
SeTformer achieves impressive top-1 accuracies of 84.7% and 86.2% on ImageNet-1K.
SeTformer also achieves state-of-the-art results in language modeling on the GLUE benchmark.
arXiv Detail & Related papers (2024-01-07T16:52:49Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - Gradient-based Parameter Selection for Efficient Fine-Tuning [41.30092426231482]
Gradient-based.
Selection (GPS) is a new parameter-efficient fine-tuning method.
GPS does not introduce any additional parameters and computational costs during both the training and inference stages.
GPS achieves 3.33% (91.78% vs. 88.45%, FGVC) and 9.61% (73.1% vs. 65.57%, VTAB) improvement of the accuracy with tuning only 0.36% parameters of the pre-trained model on average over 24 image classification tasks.
arXiv Detail & Related papers (2023-12-15T18:59:05Z) - DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks.
We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT)
DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z) - MatFormer: Nested Transformer for Elastic Inference [91.45687988953435]
MatFormer is a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints.
MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model.
We show that a 850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - Efficiently Scaling Transformer Inference [8.196193683641582]
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings.
We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices.
We achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens.
arXiv Detail & Related papers (2022-11-09T18:50:38Z) - Scaling & Shifting Your Features: A New Baseline for Efficient Model
Tuning [126.84770886628833]
Existing finetuning methods either tune all parameters of the pretrained model (full finetuning) or only tune the last linear layer (linear probing)
We propose a new parameter-efficient finetuning method termed as SSF, representing that researchers only need to Scale and Shift the deep Features extracted by a pre-trained model to catch up with the performance full finetuning.
arXiv Detail & Related papers (2022-10-17T08:14:49Z) - An Efficient Deep Learning Model for Automatic Modulation Recognition
Based on Parameter Estimation and Transformation [3.3941243094128035]
This letter proposes an efficient DL-AMR model based on phase parameter estimation and transformation.
Our model is more competitive in training time and test time than the benchmark models with similar recognition accuracy.
arXiv Detail & Related papers (2021-10-11T03:28:28Z) - Non-Parametric Adaptive Network Pruning [125.4414216272874]
We introduce non-parametric modeling to simplify the algorithm design.
Inspired by the face recognition community, we use a message passing algorithm to obtain an adaptive number of exemplars.
EPruner breaks the dependency on the training data in determining the "important" filters.
arXiv Detail & Related papers (2021-01-20T06:18:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.