Solving for X and Beyond: Can Large Language Models Solve Complex Math Problems with More-Than-Two Unknowns?
- URL: http://arxiv.org/abs/2407.05134v1
- Date: Sat, 6 Jul 2024 17:01:04 GMT
- Title: Solving for X and Beyond: Can Large Language Models Solve Complex Math Problems with More-Than-Two Unknowns?
- Authors: Kuei-Chun Kao, Ruochen Wang, Cho-Jui Hsieh,
- Abstract summary: Large Language Models (LLMs) have demonstrated remarkable performance in solving math problems.
This paper introduces a novel benchmark, BeyondX, designed to address these limitations by incorporating problems with multiple unknowns.
Empirical study on BeyondX reveals that the performance of existing LLMs, even those fine-tuned specifically on math tasks, significantly decreases as the number of unknowns increases.
- Score: 57.80779199039929
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in solving math problems, a hallmark of human intelligence. Despite high success rates on current benchmarks; however, these often feature simple problems with only one or two unknowns, which do not sufficiently challenge their reasoning capacities. This paper introduces a novel benchmark, BeyondX, designed to address these limitations by incorporating problems with multiple unknowns. Recognizing the challenges in proposing multi-unknown problems from scratch, we developed BeyondX using an innovative automated pipeline that progressively increases complexity by expanding the number of unknowns in simpler problems. Empirical study on BeyondX reveals that the performance of existing LLMs, even those fine-tuned specifically on math tasks, significantly decreases as the number of unknowns increases - with a performance drop of up to 70\% observed in GPT-4. To tackle these challenges, we propose the Formulate-and-Solve strategy, a generalized prompting approach that effectively handles problems with an arbitrary number of unknowns. Our findings reveal that this strategy not only enhances LLM performance on the BeyondX benchmark but also provides deeper insights into the computational limits of LLMs when faced with more complex mathematical challenges.
Related papers
- Gap-Filling Prompting Enhances Code-Assisted Mathematical Reasoning [0.0]
Chain-of-thought (CoT) and program-of-thought (PoT) fine-tuning are common methods to transfer LLM knowledge to small language models (SLMs)
This paper introduces Gap-Filling Prompting (GFP), a novel two-step prompting strategy designed to enhance the problem-solving process for SLMs.
arXiv Detail & Related papers (2024-11-08T08:52:59Z) - BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts [59.83547898874152]
We introduce BloomWise, a new prompting technique, inspired by Bloom's taxonomy, to improve the performance of Large Language Models (LLMs)
The decision regarding the need to employ more sophisticated cognitive skills is based on self-evaluation performed by the LLM.
In extensive experiments across 4 popular math reasoning datasets, we have demonstrated the effectiveness of our proposed approach.
arXiv Detail & Related papers (2024-10-05T09:27:52Z) - MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data [20.31528845718877]
Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities.
This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset.
arXiv Detail & Related papers (2024-06-26T13:02:35Z) - Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models [47.129504708849446]
Large Language Models (LLMs) achieve impressive performance in a wide range of tasks.
LLMs show emergent abilities in mathematical reasoning benchmarks.
We evaluate three models of the Llama 2 family on different symbolic reasoning tasks.
arXiv Detail & Related papers (2024-06-05T12:22:43Z) - PECC: Problem Extraction and Coding Challenges [3.287942619833188]
We introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler.
Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate code.
Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset.
arXiv Detail & Related papers (2024-04-29T15:02:14Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces.
We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered.
Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z) - Knowledge Crosswords: Geometric Knowledge Reasoning with Large Language Models [49.23348672822087]
We propose Knowledge Crosswords, a benchmark consisting of incomplete knowledge networks bounded by structured factual constraints.
The novel setting of geometric knowledge reasoning necessitates new LM abilities beyond existing atomic/linear multi-hop QA.
We conduct extensive experiments to evaluate existing LLMs and approaches on Knowledge Crosswords.
arXiv Detail & Related papers (2023-10-02T15:43:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.