Related papers: What Does Flow Matching Bring To TD Learning?

What Does Flow Matching Bring To TD Learning?

URL: http://arxiv.org/abs/2603.04333v1
Date: Wed, 04 Mar 2026 17:51:30 GMT
Title: What Does Flow Matching Bring To TD Learning?
Authors: Bhavya Agrawalla, Michal Nauman, Aviral Kumar,
Abstract summary: Flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL)<n>We show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance.<n>We argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms.
Score: 28.717975688380488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why or how this approach differs from standard critics. Contrary to conventional belief, we show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance. Instead, we argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms. First, it enables robust value prediction through \emph{test-time recovery}, whereby iterative computation through integration dampens errors in early value estimates as more integration steps are performed. This recovery mechanism is absent in monolithic critics. Second, supervising the velocity field at multiple interpolant values induces more \emph{plastic} feature learning within the network, allowing critics to represent non-stationary TD targets without discarding previously learned features or overfitting to individual TD targets encountered during training. We formalize these effects and validate them empirically, showing that flow-matching critics substantially outperform monolithic critics (2$\times$ in final performance and around 5$\times$ in sample efficiency) in settings where loss of plasticity poses a challenge e.g., in high-UTD online RL problems, while remaining stable during learning.

Related papers

Advancing Analytic Class-Incremental Learning through Vision-Language Calibration [6.871141687303144]
Class-incremental learning (CIL) with pre-trained models (PTMs) faces a critical trade-off between efficient adaptation and long-term stability.<n>We propose textbfVILA, a novel dual-branch framework that advances analytic CIL via a two-level vision-language calibration strategy.<n>Our framework harmonizes high-fidelity prediction with the simplicity of analytic learning.
arXiv Detail & Related papers (2026-02-14T08:32:51Z)
FlowSteer: Guiding Few-Step Image Synthesis with Authentic Trajectories [82.90132015584359]
ReFlow has theoretical consistency with flow matching but suboptimal performance in practical scenarios.<n>We propose FlowSteer, a method unlocks the potential of ReFlow-based distillation by guiding the student along teacher's authentic generation trajectories.
arXiv Detail & Related papers (2025-11-24T07:13:23Z)
Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules [9.332823269318842]
Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models.<n>We establish a Functional Scaling Law that captures the full loss trajectory under arbitrary LRSs.<n>We derive explicit scaling relations in both data- and compute-limited regimes.
arXiv Detail & Related papers (2025-09-23T16:05:16Z)
floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL [26.288205235851887]
floq is an approach that parameterizes the Q-function using a velocity field and trains it using techniques from flow-matching.<n>Floq improves performance by nearly 1.8x across a suite of challenging offline RL benchmarks and online fine-tuning tasks.
arXiv Detail & Related papers (2025-09-08T16:31:09Z)
Efficient Diffusion as Low Light Enhancer [63.789138528062225]
Reflectance-Aware Trajectory Refinement (RATR) is a simple yet effective module to refine the teacher trajectory using the reflectance component of images. textbfReflectance-aware textbfDiffusion with textbfDistilled textbfTrajectory (textbfReDDiT) is an efficient and flexible distillation framework tailored for Low-Light Image Enhancement (LLIE)
arXiv Detail & Related papers (2024-10-16T08:07:18Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
Simultaneous Double Q-learning with Conservative Advantage Learning for Actor-Critic Methods [133.85604983925282]
We propose Simultaneous Double Q-learning with Conservative Advantage Learning (SDQ-CAL) Our algorithm realizes less biased value estimation and achieves state-of-the-art performance in a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2022-05-08T09:17:16Z)
On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification [18.19207291891767]
Key considerations include training targets, activation functions, and loss functions. We study a range of loss functions when speaker identity is used as the training target. We experimentally show that GELU is able to reduce the error rates of TD-SV significantly compared to sigmoid.
arXiv Detail & Related papers (2022-01-17T14:32:51Z)
Correcting Momentum in Temporal Difference Learning [95.62766731469671]
We argue that momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale. We show that this phenomenon exists, and then propose a first-order correction term to momentum. An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.
arXiv Detail & Related papers (2021-06-07T20:41:15Z)
Robust Learning via Persistency of Excitation [4.674053902991301]
We show that network training using gradient descent is equivalent to a dynamical system parameter estimation problem. We provide an efficient technique for estimating the corresponding Lipschitz constant using extreme value theory. Our approach also universally increases the adversarial accuracy by 0.1% to 0.3% points in various state-of-the-art adversarially trained models.
arXiv Detail & Related papers (2021-06-03T18:49:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.