Related papers: Mathematics and Coding are Universal AI Benchmarks

Mathematics and Coding are Universal AI Benchmarks

URL: http://arxiv.org/abs/2512.13764v1
Date: Mon, 15 Dec 2025 14:36:29 GMT
Title: Mathematics and Coding are Universal AI Benchmarks
Authors: Przemyslaw Chojecki,
Abstract summary: We study the special role of mathematics and coding inside the moduli space of psychometric batteries for AI agents.<n>We show that when paired with formal proof kernels (e.g. Lean, Coq), GVU flows on this fiber admit spectrally stable self-improvement regimes.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the special role of mathematics and coding inside the moduli space of psychometric batteries for AI agents. Building on the AAI framework and GVU dynamics from previous works, we define the Mathematics Fiber and show that, when paired with formal proof kernels (e.g. Lean, Coq), GVU flows on this fiber admit spectrally stable self-improvement regimes due to oracle-like verification. Our main technical result is a density theorem: under uniform tightness of agent outputs and a Lipschitz AAI functional, the subspace of batteries generated by mathematical theorem-proving and coding tasks is dense in the moduli space of batteries with respect to the evaluation metric. Coding alone is universal in this sense, while pure mathematics is not; its privilege is spectral rather than expressive. We interpret this as evidence that mathematics and coding provide ``universal coordinates'' for evaluation, and that formal mathematics is a natural ignition domain for recursive self-improvement in advanced AI agents.

Related papers

Towards Autonomous Mathematics Research [48.29504087871558]
We introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end-to-end in natural language.<n>Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems.<n>We demonstrate Aletheia from Olympiad problems to PhD-level exercises and most notably, through several distinct milestones in AI-assisted mathematics research.
arXiv Detail & Related papers (2026-02-10T18:50:15Z)
The Geometry of Benchmarks: A New Path Toward AGI [0.0]
We introduce a geometric framework in which all psychometric batteries for AI agents are treated as points in a structured moduli space.<n>First, we define an Autonomous AI (AAI) Scale, a Kardashev-style hierarchy of autonomy grounded in measurable performance.<n>Second, we construct a moduli space of batteries, identifying equivalence classes of benchmarks that are indistinguishable at the level of agent orderings and capability inferences.<n>Third, we introduce a general Generator-Verifier-Updater (GVU) operator that subsumes reinforcement learning, self-play, debate and verifier-based fine-tuning
arXiv Detail & Related papers (2025-12-03T21:34:09Z)
ATHENA: Agentic Team for Hierarchical Evolutionary Numerical Algorithms [4.235429894371577]
ATHENA is an agentic framework designed as an Autonomous Lab to manage the end-to-end computational research lifecycle.<n>Its core is the HENA loop, a knowledge-driven diagnostic process framed as a Contextual problem.<n>The framework achieves super-human performance, reaching validation errors of $10-14$.
arXiv Detail & Related papers (2025-12-03T06:05:27Z)
Agentic Program Verification [14.684859166069012]
We present a first Large Language Models agent, AutoRocq, for conducting program verification.<n>Unlike past works, which rely on extensive training of LLMs on proof examples, our agent learns on-the-fly and improves the proof via an iterative refinement loop.<n>In this way, our proof construction involves autonomous collaboration between the proof agent and the theorem prover.
arXiv Detail & Related papers (2025-11-21T15:51:48Z)
AutoMathKG: The automated mathematical knowledge graph based on LLM and vector database [1.799933345199395]
A mathematical knowledge graph (KG) presents knowledge within the field of mathematics in a structured manner.<n>This paper proposes AutoMathKG, a high-quality, wide-coverage, and multi-dimensional math KG capable of automatic updates.
arXiv Detail & Related papers (2025-05-19T17:41:29Z)
Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z)
Formal Mathematical Reasoning: A New Frontier in AI [60.26950681543385]
We advocate for formal mathematical reasoning and argue that it is indispensable for advancing AI4Math to the next level.<n>We summarize existing progress, discuss open challenges, and envision critical milestones to measure future success.
arXiv Detail & Related papers (2024-12-20T17:19:24Z)
Emergence of Self-Identity in AI: A Mathematical Framework and Empirical Study with Generative Large Language Models [4.036530158875673]
This paper introduces a mathematical framework for defining and quantifying self-identity in AI systems.<n>Our framework posits that self-identity emerges from two mathematically quantifiable conditions.<n>The implications of our study are immediately relevant to the fields of humanoid robotics and autonomous systems.
arXiv Detail & Related papers (2024-11-27T17:23:47Z)
LeanAgent: Lifelong Learning for Formal Theorem Proving [85.39415834798385]
We present LeanAgent, a novel lifelong learning framework for formal theorem proving.<n>LeanAgent continuously generalizes to and improves on ever-expanding mathematical knowledge.<n>It generates formal proofs for 155 theorems across 23 diverse Lean repositories.
arXiv Detail & Related papers (2024-10-08T17:11:24Z)
Understanding Matrix Function Normalizations in Covariance Pooling through the Lens of Riemannian Geometry [63.694184882697435]
Global Covariance Pooling (GCP) has been demonstrated to improve the performance of Deep Neural Networks (DNNs) by exploiting second-order statistics of high-level representations.<n>This paper provides a comprehensive and unified understanding of the matrix logarithm and power from a Riemannian geometry perspective.
arXiv Detail & Related papers (2024-07-15T07:11:44Z)
Generative Language Modeling for Automated Theorem Proving [94.01137612934842]
This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans might be addressable via generation from language models. We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance.
arXiv Detail & Related papers (2020-09-07T19:50:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.