Towards Formalizing Reinforcement Learning Theory
- URL: http://arxiv.org/abs/2511.03618v1
- Date: Wed, 05 Nov 2025 16:35:47 GMT
- Title: Towards Formalizing Reinforcement Learning Theory
- Authors: Shangtong Zhang,
- Abstract summary: We formalize the almost sure convergence of $Q$-learning and linear temporal difference (TD) learning with Markovian samples.<n>This work makes an important step towards fully formalizing convergent RL results.
- Score: 21.93657660333281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we formalize the almost sure convergence of $Q$-learning and linear temporal difference (TD) learning with Markovian samples using the Lean 4 theorem prover based on the Mathlib library. $Q$-learning and linear TD are among the earliest and most influential reinforcement learning (RL) algorithms. The investigation of their convergence properties is not only a major research topic during the early development of the RL field but also receives increasing attention nowadays. This paper formally verifies their almost sure convergence in a unified framework based on the Robbins-Siegmund theorem. The framework developed in this work can be easily extended to convergence rates and other modes of convergence. This work thus makes an important step towards fully formalizing convergent RL results. The code is available at https://github.com/ShangtongZhang/rl-theory-in-lean.
Related papers
- Statistical Learning Theory in Lean 4: Empirical Processes from Scratch [57.00315741159824]
We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory.<n>Our end-to-end formal infrastructure implement the missing contents in latest Lean 4 Mathlib library.<n>This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory.
arXiv Detail & Related papers (2026-02-02T16:24:53Z) - Coupled Variational Reinforcement Learning for Language Model General Reasoning [83.82392089177841]
We propose textitbCoupled bVari bReinforcement bLearning (CoVRL) to bridge variational inference and reinforcement learning.<n>CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over strong state-of-the-art verifier-free RL baselines.
arXiv Detail & Related papers (2025-12-14T07:03:51Z) - Extensions of Robbins-Siegmund Theorem with Applications in Reinforcement Learning [19.81737958703724]
We extend the Robbins-Siegmund theorem for almost supermartingales where the zero-order term is not summable but only square summable.<n>We obtain the first almost sure convergence rate, the first high probability concentration bound, and the first $Lp$ convergence rate for $Qp$-learning with linear function.
arXiv Detail & Related papers (2025-09-30T16:00:36Z) - Parallel-R1: Towards Parallel Thinking via Reinforcement Learning [65.68667585027232]
Parallel thinking is a novel approach for enhancing the reasoning capabilities of large language models.<n>We propose textbfParallel-R1, the first reinforcement learning framework that enables parallel thinking behaviors.<n>Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking.
arXiv Detail & Related papers (2025-09-09T17:59:35Z) - Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers [16.135928990655422]
This paper introduces textttBFS-Prover-V2, a system designed to address the dual scaling problem.<n>The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time.<n>The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time.
arXiv Detail & Related papers (2025-09-08T09:54:18Z) - Convergence Analysis of Aggregation-Broadcast in LoRA-enabled Distributed Fine-Tuning [4.255739817172272]
Federated Learning (FL) enables collaborative model training across decentralized data sources.<n>Low-Rank Adaptation (LoRA) has been introduced into FL as an efficient fine-tuning method.<n>How to aggregate LoRA-updated local models on the server remains a critical and understudied problem.
arXiv Detail & Related papers (2025-08-02T12:54:17Z) - Reinforced Latent Reasoning for LLM-based Recommendation [92.56166822197919]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z) - Randomized Asymmetric Chain of LoRA: The First Meaningful Theoretical Framework for Low-Rank Adaptation [58.288682735160585]
Low-Rank Adaptation (LoRA) is a popular technique for finetuning models.
LoRA often under performs when compared to full- parameter fine-tuning.
We present a framework that rigorously analyzes the adaptation rates of LoRA methods.
arXiv Detail & Related papers (2024-10-10T18:51:53Z) - Branching Reinforcement Learning [16.437993672422955]
We propose a novel Branching Reinforcement Learning (Branching RL) model.
We investigate Regret Minimization (RM) and Reward-Free Exploration (RFE) metrics for this model.
This model finds important applications in hierarchical recommendation systems and online advertising.
arXiv Detail & Related papers (2022-02-16T11:19:03Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.