ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning
- URL: http://arxiv.org/abs/2410.19056v1
- Date: Thu, 24 Oct 2024 18:02:37 GMT
- Title: ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning
- Authors: Xiaodong Yu, Ben Zhou, Hao Cheng, Dan Roth,
- Abstract summary: Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples.
We seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program.
We observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.
- Score: 54.70811660561151
- License:
- Abstract: Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples. However, the former approach fails to surface model's uses of shortcuts and wrong reasoning while the later poses challenges in accommodating alternative solutions. In this work, we seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We begin by extracting programs for popular math datasets (GSM8K and MATH) using GPT4-o. For those executable programs verified using the original input-output pairs, they are found to encapsulate the proper reasoning required to solve the original text questions. We then prompt GPT4-o to generate new questions using alternative input-output pairs based the extracted program. We apply the resulting datasets to evaluate a collection of LLMs. In our experiments, we observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.
Related papers
- PromptOptMe: Error-Aware Prompt Compression for LLM-based MT Evaluation Metrics [21.23509339665165]
We propose a prompt optimization approach that uses a smaller, fine-tuned language model to compress input data for evaluation prompt.
Our results show a $2.37times$ reduction in token usage without any loss in evaluation quality.
arXiv Detail & Related papers (2024-12-20T18:08:02Z) - In Context Learning and Reasoning for Symbolic Regression with Large Language Models [0.0]
Large Language Models (LLMs) are transformer-based machine learning models.
We show how GPT-4 can perform symbolic regression on equations from datasets.
This approach does not outperform established SR programs where target equations are more complex.
arXiv Detail & Related papers (2024-10-22T21:50:52Z) - OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling [62.19438812624467]
Large language models (LLMs) have exhibited their problem-solving abilities in mathematical reasoning.
We propose OptiBench, a benchmark for End-to-end optimization problem-solving with human-readable inputs and outputs.
arXiv Detail & Related papers (2024-07-13T13:27:57Z) - MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit [4.957099360745168]
Large language models (LLMs) have been explored in a variety of reasoning tasks including solving of mathematical problems.
We introduce a comprehensive mathematical evaluation toolkit that not only utilizes a python computer algebra system (CAS) for its numerical accuracy, but also integrates an optional LLM.
arXiv Detail & Related papers (2024-04-22T07:03:44Z) - MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical
Reasoning [52.97768001837269]
We present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations.
We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions.
This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems.
arXiv Detail & Related papers (2023-10-05T17:52:09Z) - Evaluating and Improving Tool-Augmented Computation-Intensive Math
Reasoning [75.74103236299477]
Chain-of-thought prompting(CoT) and tool augmentation have been validated as effective practices for improving large language models.
We propose a new approach that can deliberate the reasoning steps with tool interfaces, namely textbfDELI.
Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines.
arXiv Detail & Related papers (2023-06-04T17:02:59Z) - SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs)
We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer.
We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z) - Program of Thoughts Prompting: Disentangling Computation from Reasoning
for Numerical Reasoning Tasks [108.4568236569645]
Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks.
We propose Program of Thoughts' (PoT), which uses language models to express the reasoning process as a program.
PoT can show an average performance gain over CoT by around 12% across all the evaluated datasets.
arXiv Detail & Related papers (2022-11-22T21:06:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.