Related papers: Scaling up Masked Diffusion Models on Text

Scaling up Masked Diffusion Models on Text

URL: http://arxiv.org/abs/2410.18514v1
Date: Thu, 24 Oct 2024 08:01:22 GMT
Title: Scaling up Masked Diffusion Models on Text
Authors: Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, Chongxuan Li,
Abstract summary: Masked diffusion models (MDMs) have shown promise in language modeling, yet their scalability and effectiveness in core language tasks, such as text generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap.
Score: 43.16800764711572
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Masked diffusion models (MDMs) have shown promise in language modeling, yet their scalability and effectiveness in core language tasks, such as text generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective \emph{unsupervised classifier-free guidance} that effectively exploits large-scale unpaired data, boosting performance for conditional inference. In language understanding, a 1.1B MDM shows competitive results, outperforming the larger 1.5B GPT-2 model on four out of eight zero-shot benchmarks. In text generation, MDMs provide a flexible trade-off compared to ARMs utilizing KV-cache: MDMs match the performance of ARMs while being 1.4 times faster, or achieve higher quality than ARMs at a higher computational cost. Moreover, MDMs address challenging tasks for ARMs by effectively handling bidirectional reasoning and adapting to temporal shifts in data. Notably, a 1.1B MDM breaks the \emph{reverse curse} encountered by much larger ARMs with significantly more data and computation, such as Llama-2 (13B) and GPT-3 (175B). Our code is available at \url{https://github.com/ML-GSAI/SMDM}.

Related papers

Esoteric Language Models [31.619674001793875]
We introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms.<n>Eso-LMs set a new state of the art on standard language modeling benchmarks.<n>We are the **first to introduce KV caching for MDMs** while preserving parallel generation.
arXiv Detail & Related papers (2025-06-02T17:47:27Z)
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding [53.82301522384719]
We propose Dimple, the first Discrete Multimodal Large Language Model (DMLLM)<n>We design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase.<n>Dimple-7B surpasses LLaVA- in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models.
arXiv Detail & Related papers (2025-05-22T17:55:04Z)
dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching [27.114862565164145]
diffusion-based Large Language Models generate text by iteratively denoising masked segments.<n>dLLMs suffer from high inference latency.<n>Traditional ARM acceleration techniques are incompatible with dLLMs due to their bidirectional attention mechanism.<n>We propose dLLM-Cache, a training-free adaptive caching framework.
arXiv Detail & Related papers (2025-05-17T15:50:46Z)
Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding [55.2480439325792]
In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution. We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution. We show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation.
arXiv Detail & Related papers (2025-04-29T06:33:13Z)
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models [36.0400717590138]
We present OmniMamba, the first linear-architecture-based multimodal generation model. It generates both text and images through a unified next-token prediction paradigm. It achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks.
arXiv Detail & Related papers (2025-03-11T17:59:46Z)
Large Language Diffusion Models [77.02553707673418]
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs) We introduce LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning paradigm. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines.
arXiv Detail & Related papers (2025-02-14T08:23:51Z)
Enabling Autoregressive Models to Fill In Masked Tokens [50.9948753314669]
This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that achieves state-of-the-art masked infilling performance. MARIA combines a pre-trained and AR model by training a linear decoder that takes their hidden states as input. Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
arXiv Detail & Related papers (2025-02-09T20:02:05Z)
ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction [55.03585818289934]
Autoregressive models (ARMs) and diffusion models (DMs) represent two leading paradigms in generative modeling. We introduce Autoregressive Coherent multimodal generation with Diffusion Correction (ACDC) ACDC combines the strengths of both ARMs and DMs at the inference stage without the need for additional fine-tuning.
arXiv Detail & Related papers (2024-10-07T03:22:51Z)
Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling [47.82616476928464]
Masked diffusion models (MDMs) have emerged as a popular research topic for generative modeling of discrete data. We show that both training and sampling of MDMs are theoretically free from the time variable. We identify, for the first time, an underlying numerical issue, even with the commonly used 32-bit floating-point precision.
arXiv Detail & Related papers (2024-09-04T17:48:19Z)
Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses. C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z)
Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns. We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z)
ROIC-DM: Robust Text Inference and Classification via Diffusion Model [40.47452511263549]
This paper introduces an innovative model for robust text inference and classification, built upon diffusion models (ROIC-DM) Benefiting from its training involving denoising stages, ROIC-DM inherently exhibits greater robustness compared to conventional language models. Extensive experiments conducted with several strong textual adversarial attacks on three datasets demonstrate that ROIC-DM outperforms traditional language models in robustness.
arXiv Detail & Related papers (2024-01-07T15:05:26Z)
DFedADMM: Dual Constraints Controlled Model Inconsistency for Decentralized Federated Learning [52.83811558753284]
Decentralized learning (DFL) discards the central server and establishes a decentralized communication network. Existing DFL methods still suffer from two major challenges: local inconsistency and local overfitting.
arXiv Detail & Related papers (2023-08-16T11:22:36Z)
RobustPdM: Designing Robust Predictive Maintenance against Adversarial Attacks [0.0]
We show that adversarial attacks can cause a severe defect (up to 11X) in the RUL prediction, outperforming the effectiveness of the state-of-the-art PdM attacks by 3X. We also present a novel approximate adversarial training method to defend against adversarial attacks.
arXiv Detail & Related papers (2023-01-25T20:49:12Z)
CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.