SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling
- URL: http://arxiv.org/abs/2509.25756v2
- Date: Sun, 26 Oct 2025 04:37:16 GMT
- Title: SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling
- Authors: Yixian Zhang, Shu'ang Yu, Tonghe Zhang, Mo Guang, Haojia Hui, Kaiwen Long, Yu Wang, Chao Yu, Wenbo Ding,
- Abstract summary: Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process.<n>We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs.<n>We develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies.
- Score: 9.936731043466699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.
Related papers
- Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation [65.13627721310613]
Mean velocity policy (MVP) is a new generative policy function that models the mean velocity field to achieve the fastest one-step action generation.<n>MVP achieves state-of-the-art success rates across several challenging robotic manipulation tasks from Robomimic and OGBench.
arXiv Detail & Related papers (2026-02-14T14:44:06Z) - Euphonium: Steering Video Flow Matching via Process Reward Gradient Guided Stochastic Dynamics [49.242224984144904]
We propose Euphonium, a novel framework that steers generation via process reward gradient guided dynamics.<n>Our key insight is to formulate the sampling process as a theoretically principled algorithm that explicitly incorporates the gradient of a Process Reward Model.<n>We derive a distillation objective that internalizes the guidance signal into the flow network, eliminating inference-time dependency on the reward model.
arXiv Detail & Related papers (2026-02-04T08:59:57Z) - One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow [56.13949180229929]
We introduce a one-step generative policy for offline reinforcement learning that maps noise directly to actions via a residual reformulation of MeanFlow.<n>Our method achieves strong performance in both offline and offline-to-online reinforcement learning settings.
arXiv Detail & Related papers (2025-11-17T06:34:17Z) - Iterative Refinement of Flow Policies in Probability Space for Online Reinforcement Learning [56.47948583452555]
We introduce the Stepwise Flow Policy (SWFP) framework, founded on the key insight that discretizing the flow matching inference process via a fixed-step Euler scheme aligns it with the variational Jordan-Kinderlehrer-Otto principle from optimal transport.<n>SWFP decomposes the global flow into a sequence of small, incremental transformations between proximate distributions.<n>This decomposition yields an efficient algorithm that fine-tunes pre-trained flows via a cascade of small flow blocks, offering significant advantages.
arXiv Detail & Related papers (2025-10-17T07:43:51Z) - One-Step Flow Policy Mirror Descent [38.39095131927252]
Flow Policy Mirror Descent (FPMD) is an online RL algorithm that enables 1-step sampling during policy inference.<n>Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight flow matching models.
arXiv Detail & Related papers (2025-07-31T15:51:10Z) - Relative Entropy Pathwise Policy Optimization [66.03329137921949]
We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories.<n>We show how to combine policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning.
arXiv Detail & Related papers (2025-07-15T06:24:07Z) - ContinualFlow: Learning and Unlearning with Neural Flow Matching [13.628458744188325]
We introduce ContinualFlow, a principled framework for targeted unlearning in generative models via Flow Matching.<n>Our method leverages an energy-based reweighting loss to softly subtract undesired regions of the data distribution without retraining from scratch or requiring direct access to the samples to be unlearned.
arXiv Detail & Related papers (2025-06-23T15:20:58Z) - Flow-Based Policy for Online Reinforcement Learning [34.86742824686496]
FlowRL is a framework for online reinforcement learning that integrates flow-based policy representation with Wasserstein-2-regularized optimization.<n>We show that FlowRL achieves competitive performance in online reinforcement learning benchmarks.
arXiv Detail & Related papers (2025-06-15T10:53:35Z) - Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization [14.320131946691268]
We propose an easy-to-use and theoretically sound fine-tuning method for flow-based generative models.<n>By introducing an online rewardweighting mechanism, our approach guides the model to prioritize high-reward regions in the data manifold.<n>Our method achieves optimal policy convergence while allowing controllable trade-offs between reward and diversity.
arXiv Detail & Related papers (2025-02-09T22:45:15Z) - Reachable Polyhedral Marching (RPM): An Exact Analysis Tool for Deep-Learned Control Systems [11.93664682521114]
We focus our attention on feedforward neural networks with the rectified unit (ReLU) activation.<n>We provide an accelerated algorithm for computing ROAs that leverages the incremental and connected of affine regions.<n>Finally, we apply our methods to find a set of states that are stabilized by an image-based controller for an aircraft runway control problem.
arXiv Detail & Related papers (2022-10-15T17:15:53Z) - Guaranteed Conservation of Momentum for Learning Particle-based Fluid
Dynamics [96.9177297872723]
We present a novel method for guaranteeing linear momentum in learned physics simulations.
We enforce conservation of momentum with a hard constraint, which we realize via antisymmetrical continuous convolutional layers.
In combination, the proposed method allows us to increase the physical accuracy of the learned simulator substantially.
arXiv Detail & Related papers (2022-10-12T09:12:59Z) - Deep Equilibrium Optical Flow Estimation [80.80992684796566]
Recent state-of-the-art (SOTA) optical flow models use finite-step recurrent update operations to emulate traditional algorithms.
These RNNs impose large computation and memory overheads, and are not directly trained to model such stable estimation.
We propose deep equilibrium (DEQ) flow estimators, an approach that directly solves for the flow as the infinite-level fixed point of an implicit layer.
arXiv Detail & Related papers (2022-04-18T17:53:44Z) - Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC)
We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.