Related papers: Beyond expected value: geometric mean optimization for long-term policy performance in reinforcement learning

Beyond expected value: geometric mean optimization for long-term policy performance in reinforcement learning

URL: http://arxiv.org/abs/2508.21443v1
Date: Fri, 29 Aug 2025 09:12:41 GMT
Title: Beyond expected value: geometric mean optimization for long-term policy performance in reinforcement learning
Authors: Xinyi Sheng, Dominik Baumann,
Abstract summary: We propose a novelReinforcement learning algorithm that combines the standard ensemble average with the time-average growth rate.<n>We evaluate our algorithm in challenging simulations, where it outperforms conventional RL methods.
Score: 2.5134449616241277
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) algorithms typically optimize the expected cumulative reward, i.e., the expected value of the sum of scalar rewards an agent receives over the course of a trajectory. The expected value averages the performance over an infinite number of trajectories. However, when deploying the agent in the real world, this ensemble average may be uninformative for the performance of individual trajectories. Thus, in many applications, optimizing the long-term performance of individual trajectories might be more desirable. In this work, we propose a novel RL algorithm that combines the standard ensemble average with the time-average growth rate, a measure for the long-term performance of individual trajectories. We first define the Bellman operator for the time-average growth rate. We then show that, under multiplicative reward dynamics, the geometric mean aligns with the time-average growth rate. To address more general and unknown reward dynamics, we propose a modified geometric mean with $N$-sliding window that captures the path-dependency as an estimator for the time-average growth rate. This estimator is embedded as a regularizer into the objective, forming a practical algorithm and enabling the policy to benefit from ensemble average and time-average simultaneously. We evaluate our algorithm in challenging simulations, where it outperforms conventional RL methods.

Related papers

Model-Agnostic Solutions for Deep Reinforcement Learning in Non-Ergodic Contexts [3.5577285720638194]
Reinforcement Learning (RL) remains a central optimisation framework in machine learning.<n>Bellman equation, central to most RL algorithms, is formulated in terms of expected values of future rewards.<n>In non-ergodic environments, the ensemble average diverges from the time-average growth experienced by individual agents.
arXiv Detail & Related papers (2026-01-13T16:53:40Z)
Geometric-Mean Policy Optimization [122.95205388291987]
We propose a stabilized variant of Group Relative Policy Optimization ( GRPO)<n>Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards.<n>Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark.
arXiv Detail & Related papers (2025-07-28T09:54:05Z)
TreeRPO: Tree Relative Policy Optimization [55.97385410074841]
name is a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling.<n>Building on the group-relative reward training mechanism of GRPO, name innovatively computes rewards based on step-level groups generated during tree sampling.
arXiv Detail & Related papers (2025-06-05T15:56:38Z)
A Differential Perspective on Distributional Reinforcement Learning [7.028778922533688]
We extend distributional reinforcement learning to the average-reward setting, where an agent aims to optimize the reward received per time-step.<n>In particular, we utilize a quantile-based approach to develop the first set of algorithms that can successfully learn and/or optimize the long-run per-step reward distribution.
arXiv Detail & Related papers (2025-06-03T19:26:25Z)
When, Where and Why to Average Weights? [36.106114687828395]
Averaging checkpoints along the training trajectory is a powerful approach to improve the generalization performance of Machine Learning models.<n>We show that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost.
arXiv Detail & Related papers (2025-02-10T18:40:48Z)
Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms. We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths. We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z)
Reinforcement learning with non-ergodic reward increments: robustness via ergodicity transformations [8.44491527275706]
Application areas for reinforcement learning include autonomous driving, precision agriculture, and finance.<n>In particular, the focus of RL is typically on the expected value of the return.<n>We develop an algorithm that lets RL agents optimize the long-term performance of individual trajectories.
arXiv Detail & Related papers (2023-10-17T15:13:33Z)
Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal. We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths. We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
Joint Optimization of Multi-Objective Reinforcement Learning with Policy Gradient Based Algorithm [50.50545326342971]
We formulate the problem of maximizing a non-linear concave function of multiple long-term objectives.<n>A policy-gradient based model-free algorithm is proposed for the problem.<n>The proposed algorithm is shown to achieve convergence to within an $epsilon$ of the global optima.
arXiv Detail & Related papers (2021-05-28T22:20:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.