NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning
Tasks
- URL: http://arxiv.org/abs/2204.05660v1
- Date: Tue, 12 Apr 2022 09:36:10 GMT
- Title: NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning
Tasks
- Authors: Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva,
Peter Clark, Chitta Baral and Ashwin Kalyan
- Abstract summary: We propose NumGLUE, a benchmark that evaluates the performance of AI systems on eight different tasks.
We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models.
We hope that NumGLUE will encourage systems that perform robust and general arithmetic reasoning within language.
- Score: 37.730939229638224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given the ubiquitous nature of numbers in text, reasoning with numbers to
perform simple calculations is an important skill of AI systems. While many
datasets and models have been developed to this end, state-of-the-art AI
systems are brittle; failing to perform the underlying mathematical reasoning
when they appear in a slightly different scenario. Drawing inspiration from
GLUE that was proposed in the context of natural language understanding, we
propose NumGLUE, a multi-task benchmark that evaluates the performance of AI
systems on eight different tasks, that at their core require simple arithmetic
understanding. We show that this benchmark is far from being solved with neural
models including state-of-the-art large-scale language models performing
significantly worse than humans (lower by 46.4%). Further, NumGLUE promotes
sharing knowledge across tasks, especially those with limited training data as
evidenced by the superior performance (average gain of 3.4% on each task) when
a model is jointly trained on all the tasks as opposed to task-specific
modeling. Finally, we hope that NumGLUE will encourage systems that perform
robust and general arithmetic reasoning within language, a first step towards
being able to perform more complex mathematical reasoning.
Related papers
- MathVista: Evaluating Mathematical Reasoning of Foundation Models in
Visual Contexts [170.01089233942594]
MathVista is a benchmark designed to combine challenges from diverse mathematical and visual tasks.
The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%.
GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning.
arXiv Detail & Related papers (2023-10-03T17:57:24Z) - FERMAT: An Alternative to Accuracy for Numerical Reasoning [11.893004722079557]
numerical reasoning is measured using a single score on existing datasets.
We introduce a multi-view evaluation set for numerical reasoning in English, called FERMAT.
FerMAT evaluates models on various key numerical reasoning aspects such as number understanding, mathematical operations, and training dependency.
arXiv Detail & Related papers (2023-05-27T15:00:45Z) - Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs.
We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z) - Reflection of Thought: Inversely Eliciting Numerical Reasoning in
Language Models via Solving Linear Systems [42.782260686177395]
We propose a novel method to elicit and exploit the numerical reasoning knowledge hidden in pre-trained language models.
We first leverage simple numbers as anchors to probe the implicitly inferred arithmetic expressions from language models.
We transform and formulate the task as an analytically solvable linear system.
arXiv Detail & Related papers (2022-10-11T00:57:19Z) - A Minimalist Dataset for Systematic Generalization of Perception,
Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts.
In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images.
We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z) - SMART: A Situation Model for Algebra Story Problems via Attributed
Grammar [74.1315776256292]
We introduce the concept of a emphsituation model, which originates from psychology studies to represent the mental states of humans in problem-solving.
We show that the proposed model outperforms all previous neural solvers by a large margin while preserving much better interpretability.
arXiv Detail & Related papers (2020-12-27T21:03:40Z) - Machine Number Sense: A Dataset of Visual Arithmetic Problems for
Abstract and Relational Reasoning [95.18337034090648]
We propose a dataset, Machine Number Sense (MNS), consisting of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG)
These visual arithmetic problems are in the form of geometric figures.
We benchmark the MNS dataset using four predominant neural network models as baselines in this visual reasoning task.
arXiv Detail & Related papers (2020-04-25T17:14:58Z) - Injecting Numerical Reasoning Skills into Language Models [41.78745615537762]
High-level reasoning skills, such as numerical reasoning, are difficult to learn from a language-modeling objective only.
We show that numerical reasoning is amenable to automatic data generation, and thus one can inject this skill into pre-trained LMs.
We show that our model, GenBERT, dramatically improves performance on DROP (49.3 $rightarrow$ 72.3 F1), reaching performance that matches state-of-the-art models of comparable size.
arXiv Detail & Related papers (2020-04-09T11:14:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.