Related papers: RoDiF: Robust Direct Fine-Tuning of Diffusion Policies with Corrupted Human Feedback

RoDiF: Robust Direct Fine-Tuning of Diffusion Policies with Corrupted Human Feedback

URL: http://arxiv.org/abs/2602.00886v1
Date: Sat, 31 Jan 2026 20:17:15 GMT
Title: RoDiF: Robust Direct Fine-Tuning of Diffusion Policies with Corrupted Human Feedback
Authors: Amitesh Vatsa, Zhixian Xie, Wanxin Jin,
Abstract summary: We introduce a Unified Markov Decision Process (MDP) formulation that coherently integrates the diffusion denoising chain with environmental dynamics.<n>We propose RoDiF (Robust Direct Fine-Tuning), a method that explicitly addresses corrupted human preferences.
Score: 4.908765539565052
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion policies are a powerful paradigm for robotic control, but fine-tuning them with human preferences is fundamentally challenged by the multi-step structure of the denoising process. To overcome this, we introduce a Unified Markov Decision Process (MDP) formulation that coherently integrates the diffusion denoising chain with environmental dynamics, enabling reward-free Direct Preference Optimization (DPO) for diffusion policies. Building on this formulation, we propose RoDiF (Robust Direct Fine-Tuning), a method that explicitly addresses corrupted human preferences. RoDiF reinterprets the DPO objective through a geometric hypothesis-cutting perspective and employs a conservative cutting strategy to achieve robustness without assuming any specific noise distribution. Extensive experiments on long-horizon manipulation tasks show that RoDiF consistently outperforms state-of-the-art baselines, effectively steering pretrained diffusion policies of diverse architectures to human-preferred modes, while maintaining strong performance even under 30% corrupted preference labels.

Related papers

Breaking the Curse of Repulsion: Optimistic Distributionally Robust Policy Optimization for Off-Policy Generative Recommendation [8.112649652437705]
We argue that the solution lies in rigorously identifying the latent high-quality distribution entangled within a noisy behavior policy.<n>We prove that hard filtering is the exact solution to this DRO objective, enabling DRPO to optimally recover high-quality behaviors while strictly discarding divergence-inducing noise.
arXiv Detail & Related papers (2026-02-11T02:18:27Z)
AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models [54.56296715999545]
Reinforcement learning from human feedback shows promise for aligning diffusion and flow models.<n>Policy optimization methods such as GRPO suffer from inefficient and static sampling strategies.<n>We propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy.
arXiv Detail & Related papers (2026-02-06T16:09:50Z)
On the Plasticity and Stability for Post-Training Large Language Models [54.757672540381236]
We identify a root cause as the conflict between plasticity and stability gradients.<n>We propose Probabilistic Conflict Resolution (PCR), a framework that models gradients as random variables.<n>PCR significantly smooths the training trajectory and achieves superior performance in various reasoning tasks.
arXiv Detail & Related papers (2026-02-06T07:31:26Z)
The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving [57.652356955571065]
State-of-the-art large language model (LLM) pipelines rely on bootstrapped reasoning loops.<n>We analyze how this design choice is sensitive to the collapse of the model's distribution over reasoning paths.<n>We introduce Distributional Creative Reasoning (DCR), a unified variational objective that casts training as gradient flow through probability measures on solution traces.
arXiv Detail & Related papers (2026-01-02T17:10:31Z)
Dichotomous Diffusion Policy Optimization [46.51375996317989]
DIPOLE is a novel RL algorithm designed for stable and controllable diffusion policy optimization.<n>We also use DIPOLE to train a large vision-language-action model for end-to-end autonomous driving.
arXiv Detail & Related papers (2025-12-31T16:56:56Z)
Two-Steps Diffusion Policy for Robotic Manipulation via Genetic Denoising [22.356276412952738]
Diffusion models have achieved state-of-the-art results in robotic manipulation by imitating expert demonstrations.<n>We show that by tailoring the denoising process to the specific characteristics of embodied AI tasks, diffusion policies can operate effectively.<n>We propose a population-based sampling strategy, genetic denoising, which enhances both performance and stability.
arXiv Detail & Related papers (2025-10-24T19:52:41Z)
Policy Regularized Distributionally Robust Markov Decision Processes with Linear Function Approximation [10.35045003737115]
Decision-making under distribution shift is a central challenge in reinforcement learning (RL), where training and deployment environments differ.<n>We propose DR-RPO, a model-free online policy optimization method that learns robust policies with sublinear regret.<n>We show that DR-RPO can achieve suboptimality bounds and sample efficiency in robust RL, matching the performance of value-based approaches.
arXiv Detail & Related papers (2025-10-16T02:56:58Z)
G$^2$RPO: Granular GRPO for Precise Reward in Flow Models [74.21206048155669]
We propose a novel Granular-GRPO (G$2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions.<n>We introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales.<n>Our G$2$RPO significantly outperforms existing flow-based GRPO baselines.
arXiv Detail & Related papers (2025-10-02T12:57:12Z)
Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application [11.385703484113552]
We propose a novel semantic communication framework empowered by generative artificial intelligence (GAI)<n>A latent diffusion model (LDM)-based semantic communication framework is proposed that combines a variational autoencoder for semantic features extraction.<n>The proposed system is a training-free framework that supports zero-shot generalization, and achieves superior performance under low-SNR and out-of-distribution conditions.
arXiv Detail & Related papers (2025-06-06T03:20:32Z)
Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps [13.28742762414913]
We introduce NCDPO, a novel framework that reformulates Diffusion Policy as a noise-conditioned deterministic policy.<n>Our experiments demonstrate that NCDPO achieves sample efficiency comparable to Proximal Policy (PPO) when training from scratch.
arXiv Detail & Related papers (2025-05-15T16:33:44Z)
Divide and Conquer: Heterogeneous Noise Integration for Diffusion-based Adversarial Purification [75.09791002021947]
Existing purification methods aim to disrupt adversarial perturbations by introducing a certain amount of noise through a forward diffusion process, followed by a reverse process to recover clean examples.<n>This approach is fundamentally flawed as the uniform operation of the forward process compromises normal pixels while attempting to combat adversarial perturbations.<n>We propose a heterogeneous purification strategy grounded in the interpretability of neural networks.<n>Our method decisively applies higher-intensity noise to specific pixels that the target model focuses on while the remaining pixels are subjected to only low-intensity noise.
arXiv Detail & Related papers (2025-03-03T11:00:25Z)
Diffusion Policy Policy Optimization [37.04382170999901]
Diffusion Policy Optimization, DPPO, is an algorithmic framework for fine-tuning diffusion-based policies.<n>DPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks.<n>We show that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization.
arXiv Detail & Related papers (2024-09-01T02:47:50Z)
ROPO: Robust Preference Optimization for Large Language Models [59.10763211091664]
We propose an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models. Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods.
arXiv Detail & Related papers (2024-04-05T13:58:51Z)
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.