Exploring the Numerical Reasoning Capabilities of Language Models: A
Comprehensive Analysis on Tabular Data
- URL: http://arxiv.org/abs/2311.02216v1
- Date: Fri, 3 Nov 2023 20:05:30 GMT
- Title: Exploring the Numerical Reasoning Capabilities of Language Models: A
Comprehensive Analysis on Tabular Data
- Authors: Mubashara Akhtar, Abhilash Shankarampeta, Vivek Gupta, Arpit Patil,
Oana Cocarascu, Elena Simperl
- Abstract summary: We propose a hierarchical taxonomy for numerical reasoning skills with more than ten reasoning types across four levels.
We conduct a comprehensive evaluation of state-of-the-art models to identify reasoning challenges specific to them.
Our results show that no model consistently excels across all numerical reasoning types.
- Score: 10.124148115680315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Numbers are crucial for various real-world domains such as finance,
economics, and science. Thus, understanding and reasoning with numbers are
essential skills for language models to solve different tasks. While different
numerical benchmarks have been introduced in recent years, they are limited to
specific numerical aspects mostly. In this paper, we propose a hierarchical
taxonomy for numerical reasoning skills with more than ten reasoning types
across four levels: representation, number sense, manipulation, and complex
reasoning. We conduct a comprehensive evaluation of state-of-the-art models to
identify reasoning challenges specific to them. Henceforth, we develop a
diverse set of numerical probes employing a semi-automated approach. We focus
on the tabular Natural Language Inference (TNLI) task as a case study and
measure models' performance shifts. Our results show that no model consistently
excels across all numerical reasoning types. Among the probed models, FlanT5
(few-/zero-shot) and GPT-3.5 (few-shot) demonstrate strong overall numerical
reasoning skills compared to other models. Label-flipping probes indicate that
models often exploit dataset artifacts to predict the correct labels.
Related papers
- Explaining Datasets in Words: Statistical Models with Natural Language Parameters [66.69456696878842]
We introduce a family of statistical models -- including clustering, time series, and classification models -- parameterized by natural language predicates.
We apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other.
arXiv Detail & Related papers (2024-09-13T01:40:20Z) - Exploring Internal Numeracy in Language Models: A Case Study on ALBERT [12.431248361369466]
We propose a method for studying how Transformer-based language models internally represent numerical data.
We extract the learned embeddings these models use to represent tokens that correspond to numbers and ordinals.
Our findings illustrate that language models, trained purely to model text, can intuit basic mathematical concepts, opening avenues for NLP applications that intersect with quantitative reasoning.
arXiv Detail & Related papers (2024-04-25T12:36:19Z) - ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints.
Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal.
We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z) - FERMAT: An Alternative to Accuracy for Numerical Reasoning [11.893004722079557]
numerical reasoning is measured using a single score on existing datasets.
We introduce a multi-view evaluation set for numerical reasoning in English, called FERMAT.
FerMAT evaluates models on various key numerical reasoning aspects such as number understanding, mathematical operations, and training dependency.
arXiv Detail & Related papers (2023-05-27T15:00:45Z) - Reflection of Thought: Inversely Eliciting Numerical Reasoning in
Language Models via Solving Linear Systems [42.782260686177395]
We propose a novel method to elicit and exploit the numerical reasoning knowledge hidden in pre-trained language models.
We first leverage simple numbers as anchors to probe the implicitly inferred arithmetic expressions from language models.
We transform and formulate the task as an analytically solvable linear system.
arXiv Detail & Related papers (2022-10-11T00:57:19Z) - ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational
Finance Question Answering [70.6359636116848]
We propose a new large-scale dataset, ConvFinQA, to study the chain of numerical reasoning in conversational question answering.
Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
arXiv Detail & Related papers (2022-10-07T23:48:50Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z) - NumGPT: Improving Numeracy Ability of Generative Pre-trained Models [59.931394234642816]
We propose NumGPT, a generative pre-trained model that explicitly models the numerical properties of numbers in texts.
Specifically, it leverages a prototype-based numeral embedding to encode the mantissa of the number and an individual embedding to encode the exponent of the number.
A numeral-aware loss function is designed to integrate numerals into the pre-training objective of NumGPT.
arXiv Detail & Related papers (2021-09-07T15:06:12Z) - Knowledge-driven Data Construction for Zero-shot Evaluation in
Commonsense Question Answering [80.60605604261416]
We propose a novel neuro-symbolic framework for zero-shot question answering across commonsense tasks.
We vary the set of language models, training regimes, knowledge sources, and data generation strategies, and measure their impact across tasks.
We show that, while an individual knowledge graph is better suited for specific tasks, a global knowledge graph brings consistent gains across different tasks.
arXiv Detail & Related papers (2020-11-07T22:52:21Z) - Towards Question Format Independent Numerical Reasoning: A Set of
Prerequisite Tasks [23.72187153601608]
We introduce NUMBERGAME, a multifaceted benchmark to evaluate model performance across numerical reasoning tasks of eight diverse formats.
Two of the new types we add are about questions that require external numerical knowledge, commonsense knowledge and domain knowledge.
For building a more practical numerical reasoning system, NUMBERGAME demands four capabilities beyond numerical reasoning.
arXiv Detail & Related papers (2020-05-18T08:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.