Solving a Research Problem in Mathematical Statistics with AI Assistance
- URL: http://arxiv.org/abs/2511.18828v1
- Date: Mon, 24 Nov 2025 07:03:56 GMT
- Title: Solving a Research Problem in Mathematical Statistics with AI Assistance
- Authors: Edgar Dobriban,
- Abstract summary: We show how GPT-5 helped us solve a previously unsolved research problem in robust mathematical statistics.<n>Our problem concerns robust density estimation, where the observations are perturbed by Wasserstein-bounded contaminations.<n>GPT-5 provided crucial help along the way, including by suggesting calculations that we did not think of, and techniques that were not familiar to us.
- Score: 19.35055637720468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the last few months, AI models including large language models have improved greatly. There are now several documented examples where they have helped professional mathematical scientists prove new results, sometimes even helping resolve known open problems. In this short note, we add another example to the list, by documenting how we were able to solve a previously unsolved research problem in robust mathematical statistics with crucial help from GPT-5. Our problem concerns robust density estimation, where the observations are perturbed by Wasserstein-bounded contaminations.In a previous preprint (Chao and Dobriban, 2023, arxiv:2308.01853v2), we have obtained upper and lower bounds on the minimax optimal estimation error; which were, however, not sharp. Starting in October 2025, making significant use of GPT-5 Pro, we were able to derive the minimax optimal error rate (reported in version 3 of the above arxiv preprint). GPT-5 provided crucial help along the way, including by suggesting calculations that we did not think of, and techniques that were not familiar to us, such as the dynamic Benamou-Brenier formulation, for key steps in the analysis. Working with GPT-5 took a few weeks of effort, and we estimate that it could have taken several months to get the same results otherwise. At the same time, there are still areas where working with GPT-5 was challenging: it sometimes provided incorrect references, and glossed over details that sometimes took days of work to fill in. We outline our workflow and steps taken to mitigate issues. Overall, our work can serve as additional documentation for a new age of human-AI collaborative work in mathematical science.
Related papers
- GPT-5 vs Other LLMs in Long Short-Context Performance [2.640490999540592]
This paper evaluates the performance of four state-of-the-art models (Grok-4, GPT-4, Gemini 2.5, and GPT-5) on long short-context tasks.<n>As the input volume on the social media dataset exceeds 5K posts (70K tokens), the performance of all models degrades significantly.<n>In the GPT-5 model, despite the sharp decline in accuracy, its precision remained high at approximately 95%.
arXiv Detail & Related papers (2026-02-15T15:26:25Z) - Evaluating Frontier LLMs on PhD-Level Mathematical Reasoning: A Benchmark on a Textbook in Theoretical Computer Science about Randomized Algorithms [14.853721511192736]
Large language models (LLMs) have led to breakthroughs in automated mathematical reasoning and scientific discovery.<n>We present a benchmark of four frontier models: GP-5-Thinking, Gemini-3-Pro, Claude-Sonnet-4.5-Thinking, and Grok-4.<n>We find that while the top-tier models achieve a high accuracy rate, other models lag significantly in consistency.
arXiv Detail & Related papers (2025-12-16T00:34:55Z) - Early science acceleration experiments with GPT-5 [58.27301147653905]
We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research.<n>In these examples, the authors highlight how AI accelerated their work, and where it fell short.<n>We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI.
arXiv Detail & Related papers (2025-11-20T06:04:23Z) - Gödel Test: Can Large Language Models Solve Easy Conjectures? [40.906606632144694]
We propose the G"odel Test: evaluating whether a model can produce correct proofs for very simple, previously unsolved conjectures.<n>We study the performance of GPT-5 on five conjectures in algorithm optimization.<n>GPT-5 may represent an early step toward frontier models eventually passing the G"odel Test.
arXiv Detail & Related papers (2025-09-22T20:11:40Z) - Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline [10.177917426690703]
Large language models often struggle with Olympiad-level problems.<n>We construct a model-agnostic, verification-and-refinement pipeline.<n>We demonstrate its effectiveness on the recent IMO 2025.
arXiv Detail & Related papers (2025-07-21T17:59:49Z) - MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models.<n>We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z) - Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z) - Formal Mathematical Reasoning: A New Frontier in AI [60.26950681543385]
We advocate for formal mathematical reasoning and argue that it is indispensable for advancing AI4Math to the next level.<n>We summarize existing progress, discuss open challenges, and envision critical milestones to measure future success.
arXiv Detail & Related papers (2024-12-20T17:19:24Z) - HARP: A challenging human-annotated math reasoning benchmark [7.691786865279827]
We introduce HARP, Human Annotated Reasoning Problems (for Math), consisting of 5,409 problems from the US national math competitions (A(J)HSME, AMC, AIME, USA(J)MO).<n>Of these, 4,780 have answers that are automatically check-able (with libraries such as SymPy).<n>These problems range six difficulty levels, with frontier models performing relatively poorly on the hardest bracket of 197 problems (average accuracy 41.1% for o1-mini, and 9.6% for Gemini 1.5 Pro).<n>Our dataset also features multiple choices (for 4,110 problems) and an average of two human-written
arXiv Detail & Related papers (2024-12-11T23:31:06Z) - How does GPT-2 compute greater-than?: Interpreting mathematical
abilities in a pre-trained language model [52.92472140375308]
We use mechanistic interpretability techniques to explain the mathematical abilities of GPT-2 small.
We show that GPT-2 small's final multi-layer perceptrons boost the probability of end years greater than the start year.
Our results suggest that GPT-2 small computes greater-than using a complex but general mechanism.
arXiv Detail & Related papers (2023-04-30T21:44:21Z) - HaT5: Hate Language Identification using Text-to-Text Transfer
Transformer [1.2532400738980594]
We investigate the performance of a state-of-the art (SoTA) architecture T5 across 5 different tasks from 2 relatively diverse datasets.
To improve performance, we augment the training data by using an autoregressive model.
It reveals the difficulties of poor data annotation by using a small set of examples.
arXiv Detail & Related papers (2022-02-11T15:21:27Z) - GeoQA: A Geometric Question Answering Benchmark Towards Multimodal
Numerical Reasoning [172.36214872466707]
We focus on solving geometric problems, which requires a comprehensive understanding of textual descriptions, visual diagrams, and theorem knowledge.
We propose a Geometric Question Answering dataset GeoQA, containing 5,010 geometric problems with corresponding annotated programs.
arXiv Detail & Related papers (2021-05-30T12:34:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.