Related papers: LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

URL: http://arxiv.org/abs/2505.19223v1
Date: Sun, 25 May 2025 16:36:20 GMT
Title: LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Authors: Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, Chongxuan Li,
Abstract summary: Masked Diffusion Models (MDMs) present a promising paradigm for language modeling.<n>The challenge arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization.<n>We propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives on both the bias and variance of preference optimization gradients.
Score: 76.8317443926908
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: https://ml-gsai.github.io/LLaDA-1.5-Demo/.

Related papers

Divergence Minimization Preference Optimization for Diffusion Model Alignment [58.651951388346525]
Divergence Minimization Preference Optimization (DMPO) is a principled method for aligning diffusion models by minimizing reverse KL divergence.<n>Our results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques.<n>DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.
arXiv Detail & Related papers (2025-07-10T07:57:30Z)
Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order [38.99428012275441]
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks.<n>Traditional first-order algorithms incur prohibitive memory and computational costs that scale poorly with model size.<n>We propose zero-order (ZO) optimization methods as a memory- and compute-efficient alternative.
arXiv Detail & Related papers (2025-06-04T20:27:17Z)
PADAM: Parallel averaged Adam reduces the error for stochastic optimization in scientific machine learning [5.052293146674794]
Averaging techniques such as Ruppert--Polyak averaging and exponential movering averaging (EMA) are powerful approaches to accelerate optimization procedures of gradient descent (SGD) optimization methods such as the popular ADAM.<n>In this work we propose an averaging approach, which we refer to as parallel averaged ADAM (PADAM) in which we compute parallely different averaged variants of ADAM and during the training process dynamically select the gradients with the smallest optimization error.
arXiv Detail & Related papers (2025-05-28T08:07:34Z)
AdUE: Improving uncertainty estimation head for LoRA adapters in LLMs [1.83270805462863]
We introduce AdUE1, an efficient post-hoc uncertainty estimation (UE) method to enhance softmax-based estimates.<n>Our approach is lightweight (no base-model changes) and produces better-calibrated confidence.
arXiv Detail & Related papers (2025-05-21T12:23:40Z)
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment [12.823734370183482]
We introduce DDIM-InPO, an efficient method for direct preference alignment of diffusion models.<n>Our approach conceptualizes diffusion model as a single-step generative model, allowing us to fine-tune the outputs of specific latent variables selectively.<n> Experimental results indicate that our DDIM-InPO achieves state-of-the-art performance with just 400 steps of fine-tuning.
arXiv Detail & Related papers (2025-03-24T08:58:49Z)
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.<n>We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.<n>Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z)
Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
Diffusion Model Alignment Using Direct Preference Optimization [103.2238655827797]
Diffusion-DPO is a method to align diffusion models to human preferences by directly optimizing on human comparison data. We fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences.
arXiv Detail & Related papers (2023-11-21T15:24:05Z)
Efficient Estimation in NPIV Models: A Comparison of Various Neural Networks-Based Estimators [1.4000007799304268]
We investigate the computational performance of Artificial Neural Networks (ANNs) in semi-nonparametric instrumental variables (NPIV) models. We focus on efficient estimation of expectation and use ANN to approximate the unknown function. We conduct a large set of Monte Carlo experiments that compares the finite-sample performance in complicated designs.
arXiv Detail & Related papers (2021-10-13T15:00:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.