Towards Question Format Independent Numerical Reasoning: A Set of
Prerequisite Tasks
- URL: http://arxiv.org/abs/2005.08516v1
- Date: Mon, 18 May 2020 08:14:04 GMT
- Title: Towards Question Format Independent Numerical Reasoning: A Set of
Prerequisite Tasks
- Authors: Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva and
Chitta Baral
- Abstract summary: We introduce NUMBERGAME, a multifaceted benchmark to evaluate model performance across numerical reasoning tasks of eight diverse formats.
Two of the new types we add are about questions that require external numerical knowledge, commonsense knowledge and domain knowledge.
For building a more practical numerical reasoning system, NUMBERGAME demands four capabilities beyond numerical reasoning.
- Score: 23.72187153601608
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Numerical reasoning is often important to accurately understand the world.
Recently, several format-specific datasets have been proposed, such as
numerical reasoning in the settings of Natural Language Inference (NLI),
Reading Comprehension (RC), and Question Answering (QA). Several
format-specific models and architectures in response to those datasets have
also been proposed. However, there exists a strong need for a benchmark which
can evaluate the abilities of models, in performing question format independent
numerical reasoning, as (i) the numerical reasoning capabilities we want to
teach are not controlled by question formats, (ii) for numerical reasoning
technology to have the best possible application, it must be able to process
language and reason in a way that is not exclusive to a single format, task,
dataset or domain. In pursuit of this goal, we introduce NUMBERGAME, a
multifaceted benchmark to evaluate model performance across numerical reasoning
tasks of eight diverse formats. We add four existing question types in our
compilation. Two of the new types we add are about questions that require
external numerical knowledge, commonsense knowledge and domain knowledge. For
building a more practical numerical reasoning system, NUMBERGAME demands four
capabilities beyond numerical reasoning: (i) detecting question format directly
from data (ii) finding intermediate common format to which every format can be
converted (iii) incorporating commonsense knowledge (iv) handling data
imbalance across formats. We build several baselines, including a new model
based on knowledge hunting using a cheatsheet. However, all baselines perform
poorly in contrast to the human baselines, indicating the hardness of our
benchmark. Our work takes forward the recent progress in generic system
development, demonstrating the scope of these under-explored tasks.
Related papers
- SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data [89.2410799619405]
We introduce the Quantitative Reasoning with Data benchmark to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data.
The benchmark comprises a dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers.
To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText.
arXiv Detail & Related papers (2024-02-27T16:15:03Z) - FormulaReasoning: A Dataset for Formula-Based Numerical Reasoning [14.0148122484585]
We construct a dataset for formula-based numerical reasoning called FormulaReasoning, which consists of 5,420 reasoning-based questions.
We employ it to conduct evaluations of LLMs with size ranging from 7B to over 100B parameters utilizing zero-shot and few-shot chain-of-thought methods.
We also explore using retrieval-augmented LLMs provided with an external formula database associated with our dataset.
arXiv Detail & Related papers (2024-02-20T03:39:49Z) - Exploring the Numerical Reasoning Capabilities of Language Models: A
Comprehensive Analysis on Tabular Data [10.124148115680315]
We propose a hierarchical taxonomy for numerical reasoning skills with more than ten reasoning types across four levels.
We conduct a comprehensive evaluation of state-of-the-art models to identify reasoning challenges specific to them.
Our results show that no model consistently excels across all numerical reasoning types.
arXiv Detail & Related papers (2023-11-03T20:05:30Z) - FERMAT: An Alternative to Accuracy for Numerical Reasoning [11.893004722079557]
numerical reasoning is measured using a single score on existing datasets.
We introduce a multi-view evaluation set for numerical reasoning in English, called FERMAT.
FerMAT evaluates models on various key numerical reasoning aspects such as number understanding, mathematical operations, and training dependency.
arXiv Detail & Related papers (2023-05-27T15:00:45Z) - STREET: A Multi-Task Structured Reasoning and Explanation Benchmark [56.555662318619135]
We introduce a unified multi-task and multi-domain natural language reasoning and explanation benchmark.
We expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer.
arXiv Detail & Related papers (2023-02-13T22:34:02Z) - Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs.
We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z) - Reflection of Thought: Inversely Eliciting Numerical Reasoning in
Language Models via Solving Linear Systems [42.782260686177395]
We propose a novel method to elicit and exploit the numerical reasoning knowledge hidden in pre-trained language models.
We first leverage simple numbers as anchors to probe the implicitly inferred arithmetic expressions from language models.
We transform and formulate the task as an analytically solvable linear system.
arXiv Detail & Related papers (2022-10-11T00:57:19Z) - ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational
Finance Question Answering [70.6359636116848]
We propose a new large-scale dataset, ConvFinQA, to study the chain of numerical reasoning in conversational question answering.
Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
arXiv Detail & Related papers (2022-10-07T23:48:50Z) - Text Modular Networks: Learning to Decompose Tasks in the Language of
Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models.
We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.