Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?
- URL: http://arxiv.org/abs/2603.03202v2
- Date: Wed, 04 Mar 2026 04:22:14 GMT
- Title: Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?
- Authors: Dadi Guo, Yuejin Xie, Qingyu Liu, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Dongrui Liu, Yi R. Fung,
- Abstract summary: We investigate the potential of code agents to autonomously evolve existing math problems into more complex variations.<n>We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems.<n>This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems.
- Score: 40.0763986629474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.
Related papers
- Even with AI, Bijection Discovery is Still Hard: The Opportunities and Challenges of OpenEvolve for Novel Bijection Construction [7.629457153784809]
Evolutionary program synthesis systems such as AlphaEvolve, OpenEvolve, and ShinkaEvolve offer a new approach to AI-assisted mathematical discovery.<n>These systems utilize teams of large language models (LLMs) to generate candidate solutions to a problem as human readable code.<n>We describe the results of applying OpenEvolve to three construction problems involving Dyck paths, two of which are known and one of which is open.
arXiv Detail & Related papers (2025-11-26T02:30:17Z) - AI Agents as Universal Task Solvers [94.49762121230042]
We show that the optimal speed-up that a universal solver can achieve using past data is tightly related to their algorithmic information.<n>We argue that the key quantity to optimize when scaling reasoning models is time, whose critical role in learning has so far only been indirectly considered.
arXiv Detail & Related papers (2025-10-14T02:17:54Z) - CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images [69.93976232543066]
We propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics.<n>To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning.<n>Our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm.
arXiv Detail & Related papers (2025-10-13T17:59:55Z) - SciML Agents: Write the Solver, Not the Solution [69.5021018644143]
We introduce two new datasets: a diagnostic dataset of adversarial "misleading" problems; and a large-scale benchmark of 1,000 diverse ODE tasks.<n>We evaluate open- and closed-source LLM models along two axes: (i) unguided versus guided prompting with domain-specific knowledge; and (ii) off-the-shelf versus fine-tuned variants.<n>Preliminary results indicate that careful prompting and fine-tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.
arXiv Detail & Related papers (2025-09-12T02:53:57Z) - URSA: The Universal Research and Scientific Agent [0.39487937309998083]
We present URSA, a scientific agent ecosystem for accelerating research tasks.<n>URSA consists of a set of modular agents and tools, including coupling to advanced physics simulation codes.<n>This work highlights the architecture of URSA, as well as examples that highlight the potential of the system.
arXiv Detail & Related papers (2025-06-27T21:56:02Z) - From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking [48.90371827091671]
AutoExperiment is a benchmark that evaluates AI agents' ability to implement and run machine learning experiments.<n>We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases.<n>Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution.
arXiv Detail & Related papers (2025-06-24T15:39:20Z) - Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation [9.920563105290894]
Cogito is a neurobiologically inspired multi-agent framework to enhance the problem-solving capabilities in code generation tasks with lower cost.<n>Cogito accumulates knowledge and cognitive skills at each stage,ultimately forming a Super Role an all capable agent to perform the code generation task.
arXiv Detail & Related papers (2025-01-30T01:41:44Z) - From Next-Token to Mathematics: The Learning Dynamics of Mathematical Reasoning in Language Models [38.71041354422434]
Large Language Models (LLMs) solely trained on next-token prediction learn to solve a wide range of problems involving mathematical reasoning.<n>We show the first analysis of how mathematical reasoning abilities of several open-weight LLMs develop during pre-training and post-training.
arXiv Detail & Related papers (2024-07-01T01:56:28Z) - MechAgents: Large language model multi-agent collaborations can solve
mechanics problems, generate new data, and integrate knowledge [0.6708125191843434]
A set of AI agents can solve mechanics tasks, here demonstrated for elasticity problems, via autonomous collaborations.
A two-agent team can effectively write, execute and self-correct code, in order to apply finite element methods to solve classical elasticity problems.
For more complex tasks, we construct a larger group of agents with enhanced division of labor among planning, formulating, coding, executing and criticizing the process and results.
arXiv Detail & Related papers (2023-11-14T13:49:03Z) - Measuring Mathematical Problem Solving With the MATH Dataset [55.4376028963537]
We introduce MATH, a dataset of 12,500 challenging competition mathematics problems.
Each problem has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
We also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics.
arXiv Detail & Related papers (2021-03-05T18:59:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.