Related papers: Iterative Refinement of Flow Policies in Probability Space for Online Reinforcement Learning

Iterative Refinement of Flow Policies in Probability Space for Online Reinforcement Learning

URL: http://arxiv.org/abs/2510.15388v1
Date: Fri, 17 Oct 2025 07:43:51 GMT
Title: Iterative Refinement of Flow Policies in Probability Space for Online Reinforcement Learning
Authors: Mingyang Sun, Pengxiang Ding, Weinan Zhang, Donglin Wang,
Abstract summary: We introduce the Stepwise Flow Policy (SWFP) framework, founded on the key insight that discretizing the flow matching inference process via a fixed-step Euler scheme aligns it with the variational Jordan-Kinderlehrer-Otto principle from optimal transport.<n>SWFP decomposes the global flow into a sequence of small, incremental transformations between proximate distributions.<n>This decomposition yields an efficient algorithm that fine-tunes pre-trained flows via a cascade of small flow blocks, offering significant advantages.
Score: 56.47948583452555
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While behavior cloning with flow/diffusion policies excels at learning complex skills from demonstrations, it remains vulnerable to distributional shift, and standard RL methods struggle to fine-tune these models due to their iterative inference process and the limitations of existing workarounds. In this work, we introduce the Stepwise Flow Policy (SWFP) framework, founded on the key insight that discretizing the flow matching inference process via a fixed-step Euler scheme inherently aligns it with the variational Jordan-Kinderlehrer-Otto (JKO) principle from optimal transport. SWFP decomposes the global flow into a sequence of small, incremental transformations between proximate distributions. Each step corresponds to a JKO update, regularizing policy changes to stay near the previous iterate and ensuring stable online adaptation with entropic regularization. This decomposition yields an efficient algorithm that fine-tunes pre-trained flows via a cascade of small flow blocks, offering significant advantages: simpler/faster training of sub-models, reduced computational/memory costs, and provable stability grounded in Wasserstein trust regions. Comprehensive experiments demonstrate SWFP's enhanced stability, efficiency, and superior adaptation performance across diverse robotic control benchmarks.

Related papers

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training [18.849117699859622]
Training stability is a central challenge in reinforcement learning for large language models.<n>We propose Variational sEquence-level Soft Policy Optimization (VESPO)<n> Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution.
arXiv Detail & Related papers (2026-02-11T09:48:08Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z)
Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models [7.316631310935769]
Vision-Language-Action (VLA) models have shown strong generalization by leveraging large-scale demonstrations.<n>We propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective.<n>We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL.
arXiv Detail & Related papers (2025-10-11T03:11:18Z)
Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z)
SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling [9.936731043466699]
Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process.<n>We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs.<n>We develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies.
arXiv Detail & Related papers (2025-09-30T04:21:20Z)
Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces [57.466101098183884]
Reinforcement learning (RL) struggles to scale to large, action spaces common in many real-world problems.<n>This paper introduces a novel framework for training discrete diffusion models as highly effective policies in complex settings.
arXiv Detail & Related papers (2025-09-26T21:53:36Z)
One-Step Flow Policy Mirror Descent [52.31612487608593]
Flow Policy Mirror Descent (FPMD) is an online RL algorithm that enables 1-step sampling during flow policy inference.<n>Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight flow matching models.
arXiv Detail & Related papers (2025-07-31T15:51:10Z)
Generalized Incremental Learning under Concept Drift across Evolving Data Streams [32.62505920071586]
Real-world data streams exhibit inherent non-stationarity characterized by concept drift, posing significant challenges for adaptive learning systems.<n>We formalize Generalized Incremental Learning under Concept Drift (GILCD), characterizing the joint evolution of distributions and label spaces in open-environment streaming contexts.<n>We propose Calibrated Source-Free Adaptation (CSFA), which fuses emerging prototypes with base representations, enabling stable new-class identification.
arXiv Detail & Related papers (2025-06-06T04:36:24Z)
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z)
Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization [14.320131946691268]
We propose an easy-to-use and theoretically sound fine-tuning method for flow-based generative models.<n>By introducing an online rewardweighting mechanism, our approach guides the model to prioritize high-reward regions in the data manifold.<n>Our method achieves optimal policy convergence while allowing controllable trade-offs between reward and diversity.
arXiv Detail & Related papers (2025-02-09T22:45:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.