Evaluating Mathematical Reasoning Beyond Accuracy
- URL: http://arxiv.org/abs/2404.05692v1
- Date: Mon, 8 Apr 2024 17:18:04 GMT
- Title: Evaluating Mathematical Reasoning Beyond Accuracy
- Authors: Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, Pengfei Liu,
- Abstract summary: We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.
We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets.
We observe that ReasonEval can play a significant role in data selection.
- Score: 50.09931172314218
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight can mask underlying problems, such as logical errors or unnecessary steps in the reasoning process. To measure reasoning beyond final-answer accuracy, we introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. ReasonEval employs $\textit{validity}$ and $\textit{redundancy}$ to characterize the reasoning quality, as well as accompanying LLMs to assess them automatically. Instantiated by base models that possess strong mathematical knowledge and trained with high-quality labeled data, ReasonEval achieves state-of-the-art performance on human-labeled datasets and can accurately detect different types of errors generated by perturbation. When applied to evaluate LLMs specialized in math, we find that an increase in final-answer accuracy does not necessarily guarantee an improvement in the overall quality of the reasoning steps for challenging mathematical problems. Additionally, we observe that ReasonEval can play a significant role in data selection. We release the best-performing model, meta-evaluation script, and all evaluation results at https://github.com/GAIR-NLP/ReasonEval.
Related papers
- Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist [46.670206614087334]
How to comprehensively define and evaluate the mathematical abilities of large language models (LLMs) has emerged as a critical issue.
We introduce MATHCHECK, a well-designed checklist for testing task generalization and reasoning robustness.
We adopt MATHCHECK-GSM and MATHCHECK-GEO to evaluate over 20 LLMs and 11 MLLMs, assessing their comprehensive mathematical reasoning abilities.
arXiv Detail & Related papers (2024-07-11T17:58:58Z) - AlphaMath Almost Zero: process Supervision without process [6.318873143509028]
Large language models (LLMs) struggle with complex problems that require multiple reasoning steps.
We introduce an innovative approach that bypasses the need for process annotations (from human or GPTs) by utilizing the Monte Carlo Tree Search (MCTS) framework.
Our method iteratively trains the policy and value models, leveraging the capabilities of a well-pretrained LLM.
arXiv Detail & Related papers (2024-05-06T15:20:30Z) - Improving Language Model Reasoning with Self-motivated Learning [60.779625789039486]
textitSelf-motivated Learning framework motivates the model itself to automatically generate rationales on existing datasets.
We train a reward model with the rank to evaluate the quality of rationales, and improve the performance of reasoning through reinforcement learning.
arXiv Detail & Related papers (2024-04-10T14:05:44Z) - MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification [41.53026834367054]
This paper introduces a novel benchmark, MM-MATH, for evaluating multimodal math reasoning.
MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points.
The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans.
arXiv Detail & Related papers (2024-04-07T22:16:50Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Estimating Fr\'echet bounds for validating programmatic weak supervision [50.13475056199486]
We develop methods for estimating Fr'eche's bounds on (possibly high-dimensional) distribution classes in which some variables are continuous-valued.
We demonstrate the usefulness of our algorithms by evaluating the performance of machine learning (ML) models trained with programmatic weak supervision (PWS)
arXiv Detail & Related papers (2023-12-07T07:15:11Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - Model Optimization in Imbalanced Regression [2.580765958706854]
Imbalanced domain learning aims to produce accurate models in predicting instances that, though underrepresented, are of utmost importance for the domain.
One of the main reasons for this is the lack of loss functions capable of focusing on minimizing the errors of extreme (rare) values.
Recently, an evaluation metric was introduced: Squared Error Relevance Area (SERA)
This metric posits a bigger emphasis on the errors committed at extreme values while also accounting for the performance in the overall target variable domain.
arXiv Detail & Related papers (2022-06-20T20:23:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.