Related papers: Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

URL: http://arxiv.org/abs/2412.12953v1
Date: Tue, 17 Dec 2024 14:34:51 GMT
Title: Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
Authors: Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov,
Abstract summary: Mixture-of-Denoising Experts (MoDE) is a novel policy for Imitation Learning.<n>MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies.<n>MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks.
Score: 19.66373610185542
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.

Related papers

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers [86.5541501589166]
DiffMoE is a batch-level global token pool that enables experts to access global token distributions during training. It achieves state-of-the-art performance among diffusion models on ImageNet benchmark. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation.
arXiv Detail & Related papers (2025-03-18T17:57:07Z)
One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step. To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration. Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z)
SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism [1.7170348600689374]
We propose a novel compression methodology that dynamically determines the rank of each layer using a soft thresholding mechanism. We have successfully applied the proposed technique to attention-based architectures, including BERT for discriminative tasks and GPT2 and TinyLlama for generative tasks. Our experiments demonstrate that the proposed technique achieves a speed-up of 1.33X to 1.72X in the encoder/ decoder with a 50% reduction in total parameters.
arXiv Detail & Related papers (2024-11-15T19:29:51Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs. We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z)
Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs. Current MoE models often display parameter inefficiency. We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks. We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT) DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z)
FQ-ViT: Fully Quantized Vision Transformer without Retraining [13.82845665713633]
We present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers. We are the first to achieve comparable accuracy degradation (1%) on fully quantized Vision Transformers.
arXiv Detail & Related papers (2021-11-27T06:20:53Z)
Dynamic Multi-scale Convolution for Dialect Identification [18.132769601922682]
We propose dynamic multi-scale convolution, which consists of dynamic kernel convolution, local multi-scale learning, and global multi-scale pooling. The proposed architecture significantly outperforms state-of-the-art system on the AP20-OLR-dialect-task of oriental language recognition.
arXiv Detail & Related papers (2021-08-02T03:37:15Z)
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts [29.582683923988203]
Mixture of Experts (MoE) based Transformer has shown promising results in many domains. In this work, we explore the MoE based model for speech recognition, named SpeechMoE. New router architecture is used in SpeechMoE which can simultaneously utilize the information from a shared embedding network.
arXiv Detail & Related papers (2021-05-07T02:38:23Z)
End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.