MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts
- URL: http://arxiv.org/abs/2512.20604v1
- Date: Tue, 23 Dec 2025 18:50:54 GMT
- Title: MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts
- Authors: Alexandros Christoforos, Chadbourne Davis,
- Abstract summary: MoE-DiffuSeq is a mixture of experts based framework for enhancing diffusion models in long document generation.<n>MoE-DiffuSeq integrates sparse attention with a mixture of experts architecture, enabling efficient and scalable long sequence modeling.
- Score: 45.88028371034407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present MoE-DiffuSeq, a mixture of experts based framework for enhancing diffusion models in long document generation. Existing diffusion based text generation models, such as DiffuSeq, suffer from high computational cost and memory overhead when applied to extended sequences. To address these challenges, MoE-DiffuSeq integrates sparse attention with a mixture of experts architecture, enabling efficient and scalable long sequence modeling. Our approach introduces a customized sparse attention mechanism designed to reduce computational complexity while preserving text quality and coherence. In addition, we incorporate a soft absorbing state within the diffusion process to accelerate sequence reconstruction and improve generation precision. Extensive experiments demonstrate that MoE-DiffuSeq significantly improves training efficiency and sampling speed compared to existing diffusion models. These advantages are particularly effective for long document scenarios, including scientific article generation, code repository modeling, and long form dialogue generation. Benchmark results further show that MoE-DiffuSeq improves efficiency, speed, accuracy, and expressiveness, advancing the practical applicability of diffusion models for high quality long form text generation.
Related papers
- LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model [77.66516875262963]
We present textbfLLaDA-o, an effective and length-adaptive omni diffusion model for multimodal understanding and generation.<n>Building on MoD, we introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings.<n>Experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks.
arXiv Detail & Related papers (2026-03-01T12:05:06Z) - Enhancing few-shot time series forecasting with LLM-guided diffusion [12.286204074670236]
Time series forecasting in specialized domains is often constrained by limited data availability.<n>We propose LTSM-DIFF, a novel learning framework that integrates the expressive power of large language models with the generative capability of diffusion models.<n>Our work establishes a new paradigm for time series analysis under data scarcity.
arXiv Detail & Related papers (2026-01-19T06:30:05Z) - SA-DiffuSeq: Addressing Computational and Scalability Challenges in Long-Document Generation with Sparse Attention [45.88028371034407]
SA-DiffuSeq is a diffusion framework that integrates sparse attention to improve scalability for long document modeling.<n>Our results indicate that incorporating structured sparsity into diffusion models is a promising direction for efficient and expressive long text generation.
arXiv Detail & Related papers (2025-12-23T19:35:02Z) - OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot [4.990334603434127]
OBS-Diff is a novel one-shot pruning framework that enables accurate and training-free compression of large-scale text-to-image diffusion models.<n>Extensive experiments show that OBS-Diff achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.
arXiv Detail & Related papers (2025-10-08T08:19:15Z) - SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation [62.14510717860079]
We propose a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion.<n>SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation.<n>Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation.
arXiv Detail & Related papers (2025-10-07T17:29:28Z) - Masked Autoencoders Are Effective Tokenizers for Diffusion Models [56.08109308294133]
MAETok is an autoencoder that learns semantically rich latent space while maintaining reconstruction fidelity.<n>MaETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation.
arXiv Detail & Related papers (2025-02-05T18:42:04Z) - Energy-Based Diffusion Language Models for Text Generation [126.23425882687195]
Energy-based Diffusion Language Model (EDLM) is an energy-based model operating at the full sequence level for each diffusion step.<n>Our framework offers a 1.3$times$ sampling speedup over existing diffusion models.
arXiv Detail & Related papers (2024-10-28T17:25:56Z) - Discrete Diffusion Language Model for Efficient Text Summarization [19.267738861590487]
We introduce a novel semantic-aware noising process that enables Transformer backbones to handle long sequences effectively.<n>Our approaches achieve state-of-the-art performance on three benchmark summarization datasets: Gigaword, CNN/DailyMail, and Arxiv.
arXiv Detail & Related papers (2024-06-25T09:55:22Z) - Memory-Efficient Fine-Tuning for Quantized Diffusion Model [12.875837358532422]
We introduce TuneQDM, a memory-efficient fine-tuning method for quantized diffusion models.
Our method consistently outperforms the baseline in both single-/multi-subject generations.
arXiv Detail & Related papers (2024-01-09T03:42:08Z) - The Missing U for Efficient Diffusion Models [3.712196074875643]
Diffusion Probabilistic Models yield record-breaking performance in tasks such as image synthesis, video generation, and molecule design.
Despite their capabilities, their efficiency, especially in the reverse process, remains a challenge due to slow convergence rates and high computational costs.
We introduce an approach that leverages continuous dynamical systems to design a novel denoising network for diffusion models.
arXiv Detail & Related papers (2023-10-31T00:12:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.