Related papers: Distributional value gradients for stochastic environments

Distributional value gradients for stochastic environments

URL: http://arxiv.org/abs/2601.20071v2
Date: Fri, 30 Jan 2026 10:15:42 GMT
Title: Distributional value gradients for stochastic environments
Authors: Baptiste Debes, Tinne Tuytelaars,
Abstract summary: Gradient-regularized value learning methods improve sample efficiency by leveraging learned models of transition dynamics and rewards to estimate return.<n>In this work, we address these limitations by extending distributional reinforcement learning on continuous state-action spaces.<n>Inspired by Value Gradients (SVG), our method utilizes a one-step world model of reward and transition distributions implemented via a conditional Vari Autoencoder (cVAE).
Score: 37.5115685757579
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gradient-regularized value learning methods improve sample efficiency by leveraging learned models of transition dynamics and rewards to estimate return gradients. However, existing approaches, such as MAGE, struggle in stochastic or noisy environments, limiting their applicability. In this work, we address these limitations by extending distributional reinforcement learning on continuous state-action spaces to model not only the distribution over scalar state-action value functions but also over their gradients. We refer to this approach as Distributional Sobolev Training. Inspired by Stochastic Value Gradients (SVG), our method utilizes a one-step world model of reward and transition distributions implemented via a conditional Variational Autoencoder (cVAE). The proposed framework is sample-based and employs Max-sliced Maximum Mean Discrepancy (MSMMD) to instantiate the distributional Bellman operator. We prove that the Sobolev-augmented Bellman operator is a contraction with a unique fixed point, and highlight a fundamental smoothness trade-off underlying contraction in gradient-aware RL. To validate our method, we first showcase its effectiveness on a simple stochastic reinforcement learning toy problem, then benchmark its performance on several MuJoCo environments.

Related papers

EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models [42.41157160976886]
We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model.<n>Existing approaches either replace discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator.<n>We introduce EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model.
arXiv Detail & Related papers (2026-02-04T19:37:14Z)
Euphonium: Steering Video Flow Matching via Process Reward Gradient Guided Stochastic Dynamics [49.242224984144904]
We propose Euphonium, a novel framework that steers generation via process reward gradient guided dynamics.<n>Our key insight is to formulate the sampling process as a theoretically principled algorithm that explicitly incorporates the gradient of a Process Reward Model.<n>We derive a distillation objective that internalizes the guidance signal into the flow network, eliminating inference-time dependency on the reward model.
arXiv Detail & Related papers (2026-02-04T08:59:57Z)
Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z)
Conditioning Diffusions Using Malliavin Calculus [18.62300657866048]
In generative modelling and optimal control, a central computational task is to modify a reference diffusion process to maximise a given terminal-time reward.<n>We introduce a novel framework based on Malliavin calculus and centred around a generalisation of the Tweedie score formula to nonlinear differential equations.<n>This allows our approach to handle a broad range of applications, like diffusion bridges, or adding conditional controls to an already trained diffusion model.
arXiv Detail & Related papers (2025-04-04T14:10:21Z)
TransFusion: Covariate-Shift Robust Transfer Learning for High-Dimensional Regression [11.040033344386366]
We propose a two-step method with a novel fused-regularizer to improve the learning performance on a target task with limited samples. Nonasymptotic bound is provided for the estimation error of the target model. We extend the method to a distributed setting, allowing for a pretraining-finetuning strategy.
arXiv Detail & Related papers (2024-04-01T14:58:16Z)
Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective. We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z)
Score-based Continuous-time Discrete Diffusion Models [102.65769839899315]
We extend diffusion models to discrete variables by introducing a Markov jump process where the reverse process denoises via a continuous-time Markov chain. We show that an unbiased estimator can be obtained via simple matching the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
arXiv Detail & Related papers (2022-11-30T05:33:29Z)
GMAC: A Distributional Perspective on Actor-Critic Framework [6.243642831536256]
We propose a new method that minimizes the Cram'er distance with the multi-step Bellman target distribution generated from a novel Sample-Replacement algorithm SR($lambda$) We empirically show that GMAC captures the correct representation of value distributions and improves the performance of a conventional actor-critic method with low computational cost.
arXiv Detail & Related papers (2021-05-24T15:50:26Z)
Efficient Marginalization of Discrete and Structured Latent Variables via Sparsity [26.518803984578867]
Training neural network models with discrete (categorical or structured) latent variables can be computationally challenging. One typically resorts to sampling-based approximations of the true marginal. We propose a new training strategy which replaces these estimators by an exact yet efficient marginalization.
arXiv Detail & Related papers (2020-07-03T19:36:35Z)
Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment. One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting. But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient. We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.