Statistical Learning Theory in Lean 4: Empirical Processes from Scratch
- URL: http://arxiv.org/abs/2602.02285v1
- Date: Mon, 02 Feb 2026 16:24:53 GMT
- Title: Statistical Learning Theory in Lean 4: Empirical Processes from Scratch
- Authors: Yuanhe Zhang, Jason D. Lee, Fanghui Liu,
- Abstract summary: We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory.<n>Our end-to-end formal infrastructure implement the missing contents in latest Lean 4 Mathlib library.<n>This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory.
- Score: 57.00315741159824
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our end-to-end formal infrastructure implement the missing contents in latest Lean 4 Mathlib library, including a complete development of Gaussian Lipschitz concentration, the first formalization of Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is available at https://github.com/YuanheZ/lean-stat-learning-theory
Related papers
- Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training [61.1421888242439]
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL)<n>We propose a framework to bridge this chasm by enabling On-Policy SFT.
arXiv Detail & Related papers (2026-02-12T17:59:58Z) - Towards Formalizing Reinforcement Learning Theory [21.93657660333281]
We formalize the almost sure convergence of $Q$-learning and linear temporal difference (TD) learning with Markovian samples.<n>This work makes an important step towards fully formalizing convergent RL results.
arXiv Detail & Related papers (2025-11-05T16:35:47Z) - LOGOS: LLM-driven End-to-End Grounded Theory Development and Schema Induction for Qualitative Research [9.819685510441902]
Grounded theory offers deep insights from qualitative data, but reliance on expert-intensive manual coding presents a major scalability bottleneck.<n>We introduce LOGOS, a novel, end-to-end framework that fully automates the grounded theory workflow.<n> LOGOS integrates LLM-driven coding, semantic clustering, graph reasoning, and a novel iterative refinement process to build highly reusable codebooks.
arXiv Detail & Related papers (2025-09-29T05:16:09Z) - FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory [44.64175433092553]
Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving.<n>Yet their ability to serve as practical assistants for mathematicians, filling in missing steps within complex proofs, remains underexplored.<n>We introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning.
arXiv Detail & Related papers (2025-09-26T14:40:14Z) - Emergent Cognitive Convergence via Implementation: A Structured Loop Reflecting Four Theories of Mind [0.0]
We report a structural convergence among four influential theories of mind.<n>The convergence emerges unintentionally within a practical AI architecture known as Agentic Flow.<n>This paper proposes that intelligent architectures may evolve toward shared structural patterns shaped by practical constraints.
arXiv Detail & Related papers (2025-07-22T02:54:45Z) - Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation.<n>For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it.<n>As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z) - LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [72.54339382005732]
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean.
Existing methods are difficult to reproduce or build on, due to private code, data, and compute requirements.
This paper introduces LeanDojo: an open-source Lean toolkit consisting of toolkits, data, models.
We develop ReProver: an LLM-based prover augmented with retrieval for selecting premises from a vast math library.
arXiv Detail & Related papers (2023-06-27T17:05:32Z) - Variance Reduced ProxSkip: Algorithm, Theory and Application to
Federated Learning [0.0]
We study distributed optimization methods based on the em local training (LT) paradigm.
We show that our methods can be substantially faster in terms of the em total training cost than the state-of-the-art method ProxSkip.
arXiv Detail & Related papers (2022-07-09T20:59:30Z) - Linear Temporal Logic Modulo Theories over Finite Traces (Extended
Version) [72.38188258853155]
This paper studies Linear Temporal Logic over Finite Traces (LTLf)
proposition letters are replaced with first-order formulas interpreted over arbitrary theories.
The resulting logic, called Satisfiability Modulo Theories (LTLfMT), is semi-decidable.
arXiv Detail & Related papers (2022-04-28T17:57:33Z) - Proof Artifact Co-training for Theorem Proving with Language Models [4.934817254755007]
PACT (bf Proof bf Artifact bf Co-bf Training) is a general methodology for extracting self-supervised data from kernel-level proof terms for co-training.
We instrument Lean with a neural theorem prover driven by a Transformer language model and show that PACT improves theorem proving success rate on a held-out suite of test theorems from 32% to 48%.
arXiv Detail & Related papers (2021-02-11T18:59:24Z) - On Computation and Generalization of Generative Adversarial Imitation
Learning [134.17122587138897]
Generative Adversarial Learning (GAIL) is a powerful and practical approach for learning sequential decision-making policies.
This paper investigates the theoretical properties of GAIL.
arXiv Detail & Related papers (2020-01-09T00:40:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.