Related papers: LLaDA2.0: Scaling Up Diffusion Language Models to 100B

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

URL: http://arxiv.org/abs/2512.15745v1
Date: Wed, 10 Dec 2025 09:26:18 GMT
Title: LLaDA2.0: Scaling Up Diffusion Language Models to 100B
Authors: Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Ling Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, Yihong Zhuang,
Abstract summary: We present LLaDA2.0 -- a discrete diffusion large language models (dLLM) scaling up to 100B total parameters.<n>LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle.<n>We obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment.
Score: 96.84156938318931
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.

Related papers

LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model [77.66516875262963]
We present textbfLLaDA-o, an effective and length-adaptive omni diffusion model for multimodal understanding and generation.<n>Building on MoD, we introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings.<n>Experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks.
arXiv Detail & Related papers (2026-03-01T12:05:06Z)
DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models [43.99949601044522]
diffusion vision language model (dVLM) still lags significantly behind that of mainstream models.<n>We propose DiffusionVL, a dVLM family that could be translated from any powerful AR models.<n>DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog) bench-alongside a 2x inference speedup.
arXiv Detail & Related papers (2025-12-17T18:59:55Z)
SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation [62.14510717860079]
We propose a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion.<n>SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation.<n>Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation.
arXiv Detail & Related papers (2025-10-07T17:29:28Z)
Fast-dLLM v2: Efficient Block-Diffusion LLM [64.38006546510337]
Fast-dLLM v2 is a block diffusion language model that adapts pretrained AR models into dLLMs for parallel text generation.<n>This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens)
arXiv Detail & Related papers (2025-09-30T14:40:18Z)
LLaDA-MoE: A Sparse MoE Diffusion Language Model [88.96960440635992]
We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture.<n>LLaDA-MoE achieves competitive performance with significantly reduced computational overhead.<n>Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths.
arXiv Detail & Related papers (2025-09-29T07:38:59Z)
David helps Goliath: Inference-Time Collaboration Between Small Specialized and Large General Diffusion LMs [49.822063966687175]
Diffusion-based language models are emerging as a promising alternative to autoregressive LMs. We propose methods to scale a recently proposed diffusion model SSD-LM from 0.4B to 13B parameters. We show that SSD-2 facilitates novel ensembles with 100x smaller models that can be customized and deployed by individual users.
arXiv Detail & Related papers (2023-05-24T06:22:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.