Related papers: FlowCritic: Bridging Value Estimation with Flow Matching in Reinforcement Learning

FlowCritic: Bridging Value Estimation with Flow Matching in Reinforcement Learning

URL: http://arxiv.org/abs/2510.22686v1
Date: Sun, 26 Oct 2025 14:12:32 GMT
Title: FlowCritic: Bridging Value Estimation with Flow Matching in Reinforcement Learning
Authors: Shan Zhong, Shutong Ding, He Diao, Xiangyu Wang, Kah Chan Teh, Bei Peng,
Abstract summary: Existing works improve the reliability of value function estimation via multi-critic ensembles and distributional RL.<n>Inspired by flow matching's success in generative modeling, we propose a generative paradigm for value estimation, named FlowCritic.
Score: 8.193127364294034
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reliable value estimation serves as the cornerstone of reinforcement learning (RL) by evaluating long-term returns and guiding policy improvement, significantly influencing the convergence speed and final performance. Existing works improve the reliability of value function estimation via multi-critic ensembles and distributional RL, yet the former merely combines multi point estimation without capturing distributional information, whereas the latter relies on discretization or quantile regression, limiting the expressiveness of complex value distributions. Inspired by flow matching's success in generative modeling, we propose a generative paradigm for value estimation, named FlowCritic. Departing from conventional regression for deterministic value prediction, FlowCritic leverages flow matching to model value distributions and generate samples for value estimation.

Related papers

DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training [94.568675548967]
Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain generalization.<n>Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar.<n>We propose DFPO, a robust distributional RL framework that models values as continuous flows across time steps.
arXiv Detail & Related papers (2026-02-05T17:07:42Z)
Distributional Evaluation of Generative Models via Relative Density Ratio [12.663086000741872]
We propose a functional evaluation metric for generative models based on the relative density ratio (RDR)<n>We show that the RDR as a functional summary of the goodness-of-fit for the generative model, possesses several desirable theoretical properties.<n>We show that the estimated RDR not only allows for an effective comparison of the overall performance of competing generative models, but it can also offer a convenient means of revealing the nature of the underlying goodness-of-fit.
arXiv Detail & Related papers (2025-10-29T13:31:35Z)
Value Flows [90.1510269525399]
This paper uses modern, flexible flow-based models to estimate the full future return distributions.<n>Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE.<n>Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3times$ improvement on average in success rates.
arXiv Detail & Related papers (2025-10-09T00:57:40Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Entropy-regularized Gradient Estimators for Approximate Bayesian Inference [2.44755919161855]
This paper addresses the estimation of the Bayesian posterior to generate diverse samples by approximating the gradient flow of the Kullback-Leibler divergence.<n>It presents empirical evaluations on classification tasks to assess the method's performance and discuss its effectiveness for Model-Based Reinforcement Learning.
arXiv Detail & Related papers (2025-03-15T02:30:46Z)
Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective. We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z)
Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer. The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z)
Normality-Guided Distributional Reinforcement Learning for Continuous Control [13.818149654692863]
Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms.<n>We study the value distribution in several continuous control tasks and find that the learned value distribution is empirically quite close to normal.<n>We propose a policy update strategy based on the correctness as measured by structural characteristics of the value distribution not present in the standard value function.
arXiv Detail & Related papers (2022-08-28T02:52:10Z)
Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation. Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z)
Foresee then Evaluate: Decomposing Value Estimation with Latent Future Prediction [37.06232589005015]
Value function is the central notion of Reinforcement Learning (RL) We propose Value Decomposition with Future Prediction (VDFP) We analytically decompose the value function into a latent future dynamics part and a policy-independent trajectory return part, inducing a way to model latent dynamics and returns separately in value estimation.
arXiv Detail & Related papers (2021-03-03T07:28:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.