Learning from Delayed Feedback in Games via Extra Prediction
- URL: http://arxiv.org/abs/2509.22426v2
- Date: Fri, 07 Nov 2025 08:42:09 GMT
- Title: Learning from Delayed Feedback in Games via Extra Prediction
- Authors: Yuma Fujimoto, Kenshi Abe, Kaito Ariu,
- Abstract summary: This study raises and addresses the problem of time-delayed feedback in learning in games.<n>To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL)
- Score: 26.93300099029726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of social regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that social regret is constant in general-sum normal-form games, and the strategies last-iterate converge to the Nash equilibrium in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.
Related papers
- Learn Hard Problems During RL with Reference Guided Fine-tuning [56.56461712665904]
Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity.<n>We introduce Reference-Guided Fine-Tuning (ReGFT) to synthesize positive trajectories on hard problems and train on them before RL.<n>Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.
arXiv Detail & Related papers (2026-03-01T18:41:28Z) - Linear Convergence in Games with Delayed Feedback via Extra Prediction [26.93300099029726]
This paper derives the rate of linear convergence of Weighted Optimistic Gradient Descent-Ascent (WOGDA)<n>We analyze it as an approximation of the Extra Proximal Point (EPP), which is updated based on farther future rewards than the classical Proximal Point (PP)
arXiv Detail & Related papers (2026-02-19T15:56:27Z) - On the Learning Dynamics of RLVR at the Edge of Competence [86.52481827737097]
Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models.<n>We develop a theory of the training dynamics of RL for transformers on compositional reasoning tasks.
arXiv Detail & Related papers (2026-02-16T16:03:08Z) - Seeing the Arrow of Time in Large Multimodal Models [55.13176722268499]
Current large multimodal models (LMMs) struggle to perceive and utilize temporal directionality in video when responding to language queries.<n>We introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness.<n>For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions.
arXiv Detail & Related papers (2025-06-03T19:32:07Z) - Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning [55.36978389831446]
We recast reflective exploration within the Bayes-Adaptive RL framework.<n>Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on observed outcomes.
arXiv Detail & Related papers (2025-05-26T22:51:00Z) - Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning [60.67176246634741]
We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem.<n>We show that state-of-the-art models do not minimize regret, but one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL.
arXiv Detail & Related papers (2025-03-10T17:40:43Z) - On the Power of Perturbation under Sampling in Solving Extensive-Form Games [56.013335390600524]
We investigate how perturbation does and does not improve the Follow-the-Regularized-Leader (FTRL) algorithm in solving extensive-form games under sampling.<n>We present a unified framework for textitPerturbed FTRL algorithms and study two variants: PFTRL-KL and PFTRL-RKL.
arXiv Detail & Related papers (2025-01-28T00:29:38Z) - No-regret learning in harmonic games: Extrapolation in the face of conflicting interests [45.94247914236653]
We show that learning converges to a Nash equilibrium from any initial condition, and all players are guaranteed at most O(1) regret.<n>Results provide an in-depth understanding of no-regret learning in harmonic games.
arXiv Detail & Related papers (2024-12-28T16:28:13Z) - Bootstrapping Expectiles in Reinforcement Learning [25.793702194455772]
Many classic Reinforcement Learning (RL) algorithms rely on a Bellman operator, which involves an expectation over the next states.
In practice, this can be very simply done by replacing the $L$ loss with a more general expectile loss for the critic.
For the overestimation problem, we show that the proposed approach, ExpectRL, provides better results than a classic twin-critic.
arXiv Detail & Related papers (2024-06-06T13:51:39Z) - Achieving Better Regret against Strategic Adversaries [15.51709428653595]
We study online learning problems in which the learner has extra knowledge about the adversary's behaviour.
We propose two new online learning algorithms, Accurate Follow the Regularized Leader (AFTRL) and Prod-Best Response (Prod-BR)
AFTRL achieves $O(1)$ external regret or $O(1)$ emphforward regret against no-external regret adversary.
arXiv Detail & Related papers (2023-02-13T19:34:36Z) - Asynchronous Gradient Play in Zero-Sum Multi-agent Games [25.690033495071923]
We study asynchronous gradient plays in zero-sum polymatrix games under delayed feedbacks.
To the best of our knowledge, this work is the first that aims to understand asynchronous gradient play in zero-sum polymatrix games.
arXiv Detail & Related papers (2022-11-16T15:37:23Z) - Faster Game Solving via Predictive Blackwell Approachability: Connecting
Regret Matching and Mirror Descent [119.5481797273995]
Follow-the-regularized-leader (FTRL) and online mirror descent (OMD) are the most prevalent regret minimizers in online convex optimization.
We show that RM and RM+ are the algorithms that result from running FTRL and OMD, respectively, to select the halfspace to force at all times in the underlying Blackwell approachability game.
In experiments across 18 common zero-sum extensive-form benchmark games, we show that predictive RM+ coupled with counterfactual regret minimization converges vastly faster than the fastest prior algorithms.
arXiv Detail & Related papers (2020-07-28T16:49:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.