Related papers: Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

URL: http://arxiv.org/abs/2603.03202v2
Date: Wed, 04 Mar 2026 04:22:14 GMT
Title: Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?
Authors: Dadi Guo, Yuejin Xie, Qingyu Liu, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Dongrui Liu, Yi R. Fung,
Abstract summary: We investigate the potential of code agents to autonomously evolve existing math problems into more complex variations.<n>We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems.<n>This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems.
Score: 40.0763986629474
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.

Related papers

Even with AI, Bijection Discovery is Still Hard: The Opportunities and Challenges of OpenEvolve for Novel Bijection Construction [7.629457153784809]
Evolutionary program synthesis systems such as AlphaEvolve, OpenEvolve, and ShinkaEvolve offer a new approach to AI-assisted mathematical discovery.<n>These systems utilize teams of large language models (LLMs) to generate candidate solutions to a problem as human readable code.<n>We describe the results of applying OpenEvolve to three construction problems involving Dyck paths, two of which are known and one of which is open.
arXiv Detail & Related papers (2025-11-26T02:30:17Z)
AI Agents as Universal Task Solvers [94.49762121230042]
We show that the optimal speed-up that a universal solver can achieve using past data is tightly related to their algorithmic information.<n>We argue that the key quantity to optimize when scaling reasoning models is time, whose critical role in learning has so far only been indirectly considered.
arXiv Detail & Related papers (2025-10-14T02:17:54Z)
CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images [69.93976232543066]
We propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics.<n>To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning.<n>Our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm.
arXiv Detail & Related papers (2025-10-13T17:59:55Z)
SciML Agents: Write the Solver, Not the Solution [69.5021018644143]
We introduce two new datasets: a diagnostic dataset of adversarial "misleading" problems; and a large-scale benchmark of 1,000 diverse ODE tasks.<n>We evaluate open- and closed-source LLM models along two axes: (i) unguided versus guided prompting with domain-specific knowledge; and (ii) off-the-shelf versus fine-tuned variants.<n>Preliminary results indicate that careful prompting and fine-tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.
arXiv Detail & Related papers (2025-09-12T02:53:57Z)
URSA: The Universal Research and Scientific Agent [0.39487937309998083]
We present URSA, a scientific agent ecosystem for accelerating research tasks.<n>URSA consists of a set of modular agents and tools, including coupling to advanced physics simulation codes.<n>This work highlights the architecture of URSA, as well as examples that highlight the potential of the system.
arXiv Detail & Related papers (2025-06-27T21:56:02Z)
From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking [48.90371827091671]
AutoExperiment is a benchmark that evaluates AI agents' ability to implement and run machine learning experiments.<n>We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases.<n>Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution.
arXiv Detail & Related papers (2025-06-24T15:39:20Z)
Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code Generation [9.920563105290894]
Cogito is a neurobiologically inspired multi-agent framework to enhance the problem-solving capabilities in code generation tasks with lower cost.<n>Cogito accumulates knowledge and cognitive skills at each stage,ultimately forming a Super Role an all capable agent to perform the code generation task.
arXiv Detail & Related papers (2025-01-30T01:41:44Z)
From Next-Token to Mathematics: The Learning Dynamics of Mathematical Reasoning in Language Models [38.71041354422434]
Large Language Models (LLMs) solely trained on next-token prediction learn to solve a wide range of problems involving mathematical reasoning.<n>We show the first analysis of how mathematical reasoning abilities of several open-weight LLMs develop during pre-training and post-training.
arXiv Detail & Related papers (2024-07-01T01:56:28Z)
MechAgents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge [0.6708125191843434]
A set of AI agents can solve mechanics tasks, here demonstrated for elasticity problems, via autonomous collaborations. A two-agent team can effectively write, execute and self-correct code, in order to apply finite element methods to solve classical elasticity problems. For more complex tasks, we construct a larger group of agents with enhanced division of labor among planning, formulating, coding, executing and criticizing the process and results.
arXiv Detail & Related papers (2023-11-14T13:49:03Z)
Measuring Mathematical Problem Solving With the MATH Dataset [55.4376028963537]
We introduce MATH, a dataset of 12,500 challenging competition mathematics problems. Each problem has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. We also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics.
arXiv Detail & Related papers (2021-03-05T18:59:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.