Modeling Complex Mathematical Reasoning via Large Language Model based
MathAgent
- URL: http://arxiv.org/abs/2312.08926v2
- Date: Sun, 17 Dec 2023 03:34:36 GMT
- Title: Modeling Complex Mathematical Reasoning via Large Language Model based
MathAgent
- Authors: Haoran Liao, Qinyi Du, Shaohua Hu, Hao He, Yanyan Xu, Jidong Tian,
Yaohui Jin
- Abstract summary: Large language models (LLMs) face challenges in solving complex mathematical problems.
We propose a formal description of the mathematical solving and extend LLMs with an agent-based zero-shot framework.
Experiments on miniF2F and MATH have demonstrated the effectiveness of PRER and proposed MathAgents.
- Score: 15.81048994298046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) face challenges in solving complex mathematical
problems that require comprehensive capacities to parse the statements,
associate domain knowledge, perform compound logical reasoning, and integrate
the intermediate rationales. Tackling all these problems once could be arduous
for LLMs, thus leading to confusion in generation. In this work, we explore the
potential of enhancing LLMs with agents by meticulous decomposition and
modeling of mathematical reasoning process. Specifically, we propose a formal
description of the mathematical solving and extend LLMs with an agent-based
zero-shot framework named
$\bf{P}$lanner-$\bf{R}$easoner-$\bf{E}$xecutor-$\bf{R}$eflector (PRER). We
further provide and implement two MathAgents that define the logical forms and
inherent relations via a pool of actions in different grains and orientations:
MathAgent-M adapts its actions to LLMs, while MathAgent-H aligns with
humankind. Experiments on miniF2F and MATH have demonstrated the effectiveness
of PRER and proposed MathAgents, achieving an increase of
$12.3\%$($53.9\%\xrightarrow{}66.2\%$) on the MiniF2F, $9.2\%$
($49.8\%\xrightarrow{}59.0\%$) on MATH, and
$13.2\%$($23.2\%\xrightarrow{}35.4\%$) for level-5 problems of MATH against
GPT-4. Further analytical results provide more insightful perspectives on
exploiting the behaviors of LLMs as agents.
Related papers
- MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems [10.517708404982624]
This paper introduces the textitMulti-Agent System for conditional Mining (textbfMACM) prompting method.
It resolves intricate mathematical problems and demonstrates strong generalization capabilities across various mathematical contexts.
With the assistance of MACM, the accuracy of GPT-4 Turbo on the most challenging level five mathematical problems in the MATH dataset increase from $mathbf54.68% text to mathbf76.73%$.
arXiv Detail & Related papers (2024-04-06T21:39:01Z) - Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange [25.419977967846144]
Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks.
This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving.
arXiv Detail & Related papers (2024-03-30T12:48:31Z) - Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning [56.82041895921434]
Open-source pre-trained Large Language Models (LLMs) exhibit strong language understanding and generation capabilities.
When used as agents for dealing with complex problems in the real world, their performance is far inferior to large commercial models such as ChatGPT and GPT-4.
arXiv Detail & Related papers (2024-03-29T03:48:12Z) - Can Large Language Models Play Games? A Case Study of A Self-Play
Approach [61.15761840203145]
Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge.
Monte-Carlo Tree Search (MCTS) is a search algorithm that provides reliable decision-making solutions.
This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve turn-based zero-sum games.
arXiv Detail & Related papers (2024-03-08T19:16:29Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning [98.53491178426492]
We open-source our math reasoning LLMs InternLM-Math which is continue pre-trained from InternLM2.
We unify chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpreter in a unified seq2seq format.
Our pre-trained model achieves 30.3 on the MiniF2F test set without fine-tuning.
arXiv Detail & Related papers (2024-02-09T11:22:08Z) - Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs)
We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics.
We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z) - MathPrompter: Mathematical Reasoning using Large Language Models [7.953723258038284]
Large Language Models (LLMs) have limited performance when solving arithmetic reasoning tasks.
MathPrompter uses the Zero-shot chain-of-thought prompting technique to generate multiple Algebraic expressions or Python functions to solve the same math problem in different ways.
arXiv Detail & Related papers (2023-03-04T04:43:49Z) - Can Mean Field Control (MFC) Approximate Cooperative Multi Agent
Reinforcement Learning (MARL) with Non-Uniform Interaction? [33.484960394599455]
Mean-Field Control (MFC) is a powerful tool to solve Multi-Agent Reinforcement (MARL) problems.
In this article, we relax the assumption of exchangeability and model the interaction between agents via an arbitrary doubly matrix.
We prove that, if the reward of each agent is an affine function of the mean-field seen by that agent, then one can approximate such a non-uniform MARL problem.
arXiv Detail & Related papers (2022-02-28T19:03:09Z) - Thompson sampling for linear quadratic mean-field teams [3.957353452014781]
We consider optimal control of an unknown multi-agent linear quadratic (LQ) system where the dynamics and the cost are coupled across the agents.
We propose a new Thompson sampling based learning algorithm which exploits the structure of the system model and show that the expected Bayesian regret of our proposed algorithm for a system with agents of different types at time horizon $T$ is $tildemathcalO big( |M|1.5 sqrtT big)$ irrespective of the total number of agents.
arXiv Detail & Related papers (2020-11-09T19:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.