FERMAT: An Alternative to Accuracy for Numerical Reasoning
- URL: http://arxiv.org/abs/2305.17491v1
- Date: Sat, 27 May 2023 15:00:45 GMT
- Title: FERMAT: An Alternative to Accuracy for Numerical Reasoning
- Authors: Jasivan Alex Sivakumar and Nafise Sadat Moosavi
- Abstract summary: numerical reasoning is measured using a single score on existing datasets.
We introduce a multi-view evaluation set for numerical reasoning in English, called FERMAT.
FerMAT evaluates models on various key numerical reasoning aspects such as number understanding, mathematical operations, and training dependency.
- Score: 11.893004722079557
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While pre-trained language models achieve impressive performance on various
NLP benchmarks, they still struggle with tasks that require numerical
reasoning. Recent advances in improving numerical reasoning are mostly achieved
using very large language models that contain billions of parameters and are
not accessible to everyone. In addition, numerical reasoning is measured using
a single score on existing datasets. As a result, we do not have a clear
understanding of the strengths and shortcomings of existing models on different
numerical reasoning aspects and therefore, potential ways to improve them apart
from scaling them up. Inspired by CheckList (Ribeiro et al., 2020), we
introduce a multi-view evaluation set for numerical reasoning in English,
called FERMAT. Instead of reporting a single score on a whole dataset, FERMAT
evaluates models on various key numerical reasoning aspects such as number
understanding, mathematical operations, and training dependency. Apart from
providing a comprehensive evaluation of models on different numerical reasoning
aspects, FERMAT enables a systematic and automated generation of an arbitrarily
large training or evaluation set for each aspect.The datasets and codes are
publicly available to generate further multi-view data for ulterior tasks and
languages.
Related papers
- SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Exploring the Numerical Reasoning Capabilities of Language Models: A
Comprehensive Analysis on Tabular Data [10.124148115680315]
We propose a hierarchical taxonomy for numerical reasoning skills with more than ten reasoning types across four levels.
We conduct a comprehensive evaluation of state-of-the-art models to identify reasoning challenges specific to them.
Our results show that no model consistently excels across all numerical reasoning types.
arXiv Detail & Related papers (2023-11-03T20:05:30Z) - Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs.
We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z) - Reflection of Thought: Inversely Eliciting Numerical Reasoning in
Language Models via Solving Linear Systems [42.782260686177395]
We propose a novel method to elicit and exploit the numerical reasoning knowledge hidden in pre-trained language models.
We first leverage simple numbers as anchors to probe the implicitly inferred arithmetic expressions from language models.
We transform and formulate the task as an analytically solvable linear system.
arXiv Detail & Related papers (2022-10-11T00:57:19Z) - NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning
Tasks [37.730939229638224]
We propose NumGLUE, a benchmark that evaluates the performance of AI systems on eight different tasks.
We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models.
We hope that NumGLUE will encourage systems that perform robust and general arithmetic reasoning within language.
arXiv Detail & Related papers (2022-04-12T09:36:10Z) - RuMedBench: A Russian Medical Language Understanding Benchmark [58.99199480170909]
The paper describes the open Russian medical language understanding benchmark covering several task types.
We prepare the unified format labeling, data split, and evaluation metrics for new tasks.
A single-number metric expresses a model's ability to cope with the benchmark.
arXiv Detail & Related papers (2022-01-17T16:23:33Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - NumGPT: Improving Numeracy Ability of Generative Pre-trained Models [59.931394234642816]
We propose NumGPT, a generative pre-trained model that explicitly models the numerical properties of numbers in texts.
Specifically, it leverages a prototype-based numeral embedding to encode the mantissa of the number and an individual embedding to encode the exponent of the number.
A numeral-aware loss function is designed to integrate numerals into the pre-training objective of NumGPT.
arXiv Detail & Related papers (2021-09-07T15:06:12Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z) - Towards Question Format Independent Numerical Reasoning: A Set of
Prerequisite Tasks [23.72187153601608]
We introduce NUMBERGAME, a multifaceted benchmark to evaluate model performance across numerical reasoning tasks of eight diverse formats.
Two of the new types we add are about questions that require external numerical knowledge, commonsense knowledge and domain knowledge.
For building a more practical numerical reasoning system, NUMBERGAME demands four capabilities beyond numerical reasoning.
arXiv Detail & Related papers (2020-05-18T08:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.