THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning
- URL: http://arxiv.org/abs/2509.13761v2
- Date: Fri, 03 Oct 2025 12:48:44 GMT
- Title: THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning
- Authors: Qikai Chang, Zhenrong Zhang, Pengfei Hu, Jun Du, Jiefeng Ma, Yicheng Pan, Jianshu Zhang, Quan Liu, Jianqing Gao,
- Abstract summary: Large Language Models (LLMs) have made remarkable progress in mathematical reasoning.<n>Despite recent advances, existing methods struggle with three key challenges.<n>We propose THOR (Tool-Integrated Hierarchical Optimization via RL) to overcome these limitations.<n>Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models.
- Score: 25.605096023894834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both episode-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer's correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.
Related papers
- AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent [80.83250816918861]
Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought.<n>However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations.<n>We present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision.
arXiv Detail & Related papers (2025-12-23T19:57:49Z) - MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation [86.82285754460491]
We propose a new benchmark designed to evaluate both text and image output modalities.<n>This performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image.<n>We propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images.
arXiv Detail & Related papers (2025-11-12T18:58:21Z) - ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset [43.45582911794623]
We introduce ToolMind, a high-quality tool-agentic dataset with 160k synthetic data instances.<n>We employ fine-grained turn-level filtering to remove erroneous or suboptimal steps.<n>Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.
arXiv Detail & Related papers (2025-11-12T13:01:23Z) - Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning [29.280386584974455]
Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers.<n>These approaches have demonstrated significant performance improvements on benchmarks involving mathematical reasoning.<n>We propose Tool-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework that integrates multi-hop reasoning with adaptive tool-calling capabilities.
arXiv Detail & Related papers (2025-10-08T14:04:27Z) - Acting Less is Reasoning More! Teaching Model to Act Efficiently [87.28134636548705]
Tool-integrated reasoning augments large language models with the ability to invoke external tools to solve tasks.<n>Current approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use.<n>We propose a framework that encourages models to produce accurate answers with minimal tool calls.<n>Our approach reduces tool calls by up to 68.3% and improves tool productivity by up to 215.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z) - DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z) - iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use [56.31110409360567]
Augmenting large language models with external tools is a promising approach to enhance their capabilities.<n>We show that training gains significantly decay as synthetic data increases.<n>We propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation.
arXiv Detail & Related papers (2025-01-15T04:52:34Z) - Optimizing Chain-of-Thought Reasoning: Tackling Arranging Bottleneck via Plan Augmentation [34.042565099565934]
We propose a plan-based training and reasoning method that guides models to generate arranging steps through abstract plans.
Results show that compared to fine-tuning directly with CoT data, our approach achieves a better performance on alleviating arranging bottleneck.
arXiv Detail & Related papers (2024-10-22T08:38:50Z) - End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures.
We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.