A Causal Framework to Quantify the Robustness of Mathematical Reasoning
with Language Models
- URL: http://arxiv.org/abs/2210.12023v3
- Date: Wed, 7 Jun 2023 22:03:16 GMT
- Title: A Causal Framework to Quantify the Robustness of Mathematical Reasoning
with Language Models
- Authors: Alessandro Stolfo, Zhijing Jin, Kumar Shridhar, Bernhard Sch\"olkopf
and Mrinmaya Sachan
- Abstract summary: We study the behavior of language models in terms of robustness and sensitivity to direct interventions in the input space.
Our analysis shows that robustness does not appear to continuously improve as a function of size, but the GPT-3 Davinci models (175B) achieve a dramatic improvement in both robustness and sensitivity compared to all other GPT variants.
- Score: 81.15974174627785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We have recently witnessed a number of impressive results on hard
mathematical reasoning problems with language models. At the same time, the
robustness of these models has also been called into question; recent works
have shown that models can rely on shallow patterns in the problem description
when generating a solution. Building on the idea of behavioral testing, we
propose a novel framework, which pins down the causal effect of various factors
in the input, e.g., the surface form of the problem text, the operands, and
math operators on the output solution. By grounding the behavioral analysis in
a causal graph describing an intuitive reasoning process, we study the behavior
of language models in terms of robustness and sensitivity to direct
interventions in the input space. We apply our framework on a test bed of math
word problems. Our analysis shows that robustness does not appear to
continuously improve as a function of size, but the GPT-3 Davinci models (175B)
achieve a dramatic improvement in both robustness and sensitivity compared to
all other GPT variants.
Related papers
- Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems [0.0]
This study focuses on improving the performance of lightweight Large Language Models (LLMs) in mathematical reasoning tasks.
We introduce a novel method for measuring mathematical logic similarity and design an automatic screening mechanism.
By employing carefully crafted positive and negative example prompts, we guide the model towards adopting sound reasoning logic.
arXiv Detail & Related papers (2024-08-29T08:26:42Z) - Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models [33.91763946767206]
We study the relationship between the surface form of a mathematical problem and its solvability by large language models.
We propose Self-Consistency-over-Paraphrases (SCoP), which diversifies reasoning paths from specific surface forms of the problem.
arXiv Detail & Related papers (2024-04-17T15:53:49Z) - Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets [46.19529338280716]
Language models, characterized by their black-box nature, often hallucinate and display sensitivity to input perturbations.
We introduce a methodology designed to examine how input perturbations affect language models across various scales.
We present three distinct fine-tuning strategies to address robustness against multiple perturbations.
arXiv Detail & Related papers (2023-11-15T02:59:10Z) - Large Language Models as Analogical Reasoners [155.9617224350088]
Chain-of-thought (CoT) prompting for language models demonstrates impressive performance across reasoning tasks.
We introduce a new prompting approach, analogical prompting, designed to automatically guide the reasoning process of large language models.
arXiv Detail & Related papers (2023-10-03T00:57:26Z) - Generative Models as a Complex Systems Science: How can we make sense of
large language model behavior? [75.79305790453654]
Coaxing out desired behavior from pretrained models, while avoiding undesirable ones, has redefined NLP.
We argue for a systematic effort to decompose language model behavior into categories that explain cross-task performance.
arXiv Detail & Related papers (2023-07-31T22:58:41Z) - Opening the Black Box: Analyzing Attention Weights and Hidden States in
Pre-trained Language Models for Non-language Tasks [0.8889304968879164]
We apply a pre-trained language model to constrained arithmetic problems with hierarchical structure, to analyze their attention weight scores and hidden states.
The investigation reveals promising results, with the model addressing hierarchical problems in a moderately structured manner, similar to human problem-solving strategies.
The attention analysis allows us to hypothesize that the model can generalize to longer sequences in ListOps dataset, a conclusion later confirmed through testing on sequences longer than those in the training set.
arXiv Detail & Related papers (2023-06-21T11:48:07Z) - Logically Consistent Adversarial Attacks for Soft Theorem Provers [110.17147570572939]
We propose a generative adversarial framework for probing and improving language models' reasoning capabilities.
Our framework successfully generates adversarial attacks and identifies global weaknesses.
In addition to effective probing, we show that training on the generated samples improves the target model's performance.
arXiv Detail & Related papers (2022-04-29T19:10:12Z) - Chain of Thought Prompting Elicits Reasoning in Large Language Models [56.811278668446825]
This paper explores the ability of language models to generate a coherent chain of thought.
Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks.
arXiv Detail & Related papers (2022-01-28T02:33:07Z) - SMART: A Situation Model for Algebra Story Problems via Attributed
Grammar [74.1315776256292]
We introduce the concept of a emphsituation model, which originates from psychology studies to represent the mental states of humans in problem-solving.
We show that the proposed model outperforms all previous neural solvers by a large margin while preserving much better interpretability.
arXiv Detail & Related papers (2020-12-27T21:03:40Z) - CausaLM: Causal Model Explanation Through Counterfactual Language Models [33.29636213961804]
CausaLM is a framework for producing causal model explanations using counterfactual language representation models.
We show that language representation models such as BERT can effectively learn a counterfactual representation for a given concept of interest.
A byproduct of our method is a language representation model that is unaffected by the tested concept.
arXiv Detail & Related papers (2020-05-27T15:06:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.