Related papers: Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

URL: http://arxiv.org/abs/2502.00639v3
Date: Sun, 28 Sep 2025 09:20:24 GMT
Title: Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Authors: Tao Ren, Zishi Zhang, Jingyang Jiang, Zehao Li, Shentao Qin, Yi Zheng, Guanghao Li, Qianyou Sun, Yan Li, Jiafeng Liang, Xinping Li, Yijie Peng,
Abstract summary: probabilistic diffusion model (DM) generates content by inferencing through a chain structure.<n>Modern methods are either based on Reinforcement Learning (RL) or truncated Backpropagation (BP)<n>We propose the Recursive Likelihood Ratio (RLR) fine-tuning paradigm for DM.
Score: 16.103949557802988
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The probabilistic diffusion model (DM), generating content by inferencing through a recursive chain structure, has emerged as a powerful framework for visual generation. After pre-training on enormous data, the model needs to be properly aligned to meet requirements for downstream applications. How to efficiently align the foundation DM is a crucial task. Contemporary methods are either based on Reinforcement Learning (RL) or truncated Backpropagation (BP). However, RL and truncated BP suffer from low sample efficiency and biased gradient estimation, respectively, resulting in limited improvement or, even worse, complete training failure. To overcome the challenges, we propose the Recursive Likelihood Ratio (RLR) optimizer, a Half-Order (HO) fine-tuning paradigm for DM. The HO gradient estimator enables the computation graph rearrangement within the recursive diffusive chain, making the RLR's gradient estimator an unbiased one with lower variance than other methods. We theoretically investigate the bias, variance, and convergence of our method. Extensive experiments are conducted on image and video generation to validate the superiority of the RLR. Furthermore, we propose a novel prompt technique that is natural for the RLR to achieve a synergistic effect.

Related papers

RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models [14.093802378976315]
Diffusion-based remote sensing (RS) generative foundation models rely on large amounts of globally representative data.<n>We propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios.<n> Experiments show that, even after pruning 85% of the training data, our method significantly improves convergence and generation quality.
arXiv Detail & Related papers (2025-12-29T06:44:06Z)
Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts [25.205293698698867]
We introduce Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training.<n>Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance.<n>Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes.
arXiv Detail & Related papers (2025-08-13T18:37:46Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Flow-GRPO: Training Flow Matching Models via Online RL [80.62659379624867]
We propose Flow-GRPO, the first method to integrate online policy reinforcement learning into flow matching models.<n>Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation into an equivalent Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original number of inference steps.
arXiv Detail & Related papers (2025-05-08T17:58:45Z)
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL [20.177871969184004]
Chain-of-thought (CoT) reasoning can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps.<n>Prior approaches such as iterative reward-ranked fine-tuning fail to account for variability in difficulty and convergence behavior.<n>We propose GVMRAFT, a prompt-specific Dynamic Sample Allocation Strategy to minimize gradient variance under a computational budget constraint.
arXiv Detail & Related papers (2025-05-05T06:26:00Z)
Likelihood-Ratio Regularized Quantile Regression: Adapting Conformal Prediction to High-Dimensional Covariate Shifts [35.16750653336608]
We introduce the likelihood ratio regularized quantile regression algorithm, which combines the pinball loss with a novel choice of regularization. We show that the LR-QR method has coverage at the desired level in the target domain, up to a small error term. Our experiments demonstrate that the LR-QR algorithm outperforms existing methods on high-dimensional prediction tasks.
arXiv Detail & Related papers (2025-02-18T16:46:44Z)
Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms. We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths. We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z)
Efficient Diffusion as Low Light Enhancer [63.789138528062225]
Reflectance-Aware Trajectory Refinement (RATR) is a simple yet effective module to refine the teacher trajectory using the reflectance component of images. textbfReflectance-aware textbfDiffusion with textbfDistilled textbfTrajectory (textbfReDDiT) is an efficient and flexible distillation framework tailored for Low-Light Image Enhancement (LLIE)
arXiv Detail & Related papers (2024-10-16T08:07:18Z)
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference [15.038210624870656]
Reward inference is a critical intermediate step in the Reinforcement Learning from Human Feedback pipeline. This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDP bandit, and general preference models beyond the Bradley-Terry model.
arXiv Detail & Related papers (2024-09-25T22:20:11Z)
Reward-Directed Score-Based Diffusion Models via q-Learning [8.725446812770791]
We propose a new reinforcement learning (RL) formulation for training continuous-time score-based diffusion models for generative AI.<n>Ours does not involve any pretrained model for the unknown score functions of the noise-perturbed data distributions.<n>We show the effectiveness of our approach by comparing its performance with two state-of-the-art RL methods.
arXiv Detail & Related papers (2024-09-07T13:55:45Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
Estimating the Hessian Matrix of Ranking Objectives for Stochastic Learning to Rank with Gradient Boosted Trees [63.18324983384337]
We introduce the first learning to rank method for Gradient Boosted Decision Trees (GBDTs) Our main contribution is a novel estimator for the second-order derivatives, i.e., the Hessian matrix. We incorporate our estimator into the existing PL-Rank framework, which was originally designed for first-order derivatives only.
arXiv Detail & Related papers (2024-04-18T13:53:32Z)
CoRMF: Criticality-Ordered Recurrent Mean Field Ising Solver [4.364088891019632]
We propose an RNN-based efficient Ising model solver, the Criticality-ordered Recurrent Mean Field (CoRMF) By leveraging the approximated tree structure of the underlying Ising graph, the newly-obtained criticality order enables the unification between variational mean-field and RNN. CoRFM solves the Ising problems in a self-train fashion without data/evidence, and the inference tasks can be executed by directly sampling from RNN.
arXiv Detail & Related papers (2024-03-05T16:55:06Z)
Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics. Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes. We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.