Generating Mathematical Derivations with Large Language Models
- URL: http://arxiv.org/abs/2307.09998v3
- Date: Tue, 8 Aug 2023 12:23:49 GMT
- Title: Generating Mathematical Derivations with Large Language Models
- Authors: Jordan Meadows, Marco Valentino, Andre Freitas
- Abstract summary: We leverage a symbolic engine to generate derivations of equations at scale.
We investigate the capabilities of Large Language Models when deriving goal equations from premises.
- Score: 2.363388546004777
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The derivation of mathematical results in specialised fields, using Large
Language Models (LLMs), is an emerging research direction that can help
identify models' limitations, and potentially support mathematical discovery.
In this paper, we leverage a symbolic engine to generate derivations of
equations at scale, and investigate the capabilities of LLMs when deriving goal
equations from premises. Specifically, we employ in-context learning for GPT
and fine-tune a range of T5 models to compare the robustness and generalisation
of pre-training strategies to specialised models. Empirical results show that
fine-tuned FLAN-T5-large (MathT5) outperforms GPT models on all static and
out-of-distribution test sets in conventional scores. However, an in-depth
analysis reveals that the fine-tuned models are more sensitive to perturbations
involving unseen symbols and (to a lesser extent) changes to equation
structure. In addition, we analyse 1.7K equations, and over 200 derivations, to
highlight common reasoning errors such as the inclusion of incorrect,
irrelevant, and redundant equations. Finally, we explore the suitability of
existing metrics for evaluating mathematical derivations and find evidence
that, while they can capture general properties such as sensitivity to
perturbations, they fail to highlight fine-grained reasoning errors and
essential differences between models. Overall, this work demonstrates that
training models on synthetic data may improve their math capabilities beyond
much larger LLMs, but current metrics are not appropriately assessing the
quality of generated mathematical text.
Related papers
- Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [64.83955753606443]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities.
Current error classification methods rely on static and predefined categories.
We introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples.
arXiv Detail & Related papers (2025-01-26T16:17:57Z) - Visual Error Patterns in Multi-Modal AI: A Statistical Approach [0.0]
Multi-modal large language models (MLLMs) excel at integrating text and visual data but face systematic challenges when interpreting ambiguous or incomplete visual stimuli.
This study leverages statistical modeling to analyze the factors driving these errors, using a dataset of geometric stimuli characterized by features like 3D, rotation, and missing face/side.
arXiv Detail & Related papers (2024-11-27T01:20:08Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - SLEM: Machine Learning for Path Modeling and Causal Inference with Super
Learner Equation Modeling [3.988614978933934]
Causal inference is a crucial goal of science, enabling researchers to arrive at meaningful conclusions using observational data.
Path models, Structural Equation Models (SEMs) and Directed Acyclic Graphs (DAGs) provide a means to unambiguously specify assumptions regarding the causal structure underlying a phenomenon.
We propose Super Learner Equation Modeling, a path modeling technique integrating machine learning Super Learner ensembles.
arXiv Detail & Related papers (2023-08-08T16:04:42Z) - A Mechanistic Interpretation of Arithmetic Reasoning in Language Models
using Causal Mediation Analysis [128.0532113800092]
We present a mechanistic interpretation of Transformer-based LMs on arithmetic questions.
This provides insights into how information related to arithmetic is processed by LMs.
arXiv Detail & Related papers (2023-05-24T11:43:47Z) - A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers [17.075558137261986]
We evaluate the generalisability of Transformers to out-of-distribution mathematical reasoning problems.
We compare the capabilities of GPT-4, GPT-3.5, and a canon of fine-tuned BERT models.
Surprisingly, our evaluation reveals that the average in-distribution performance of fine-tuned models surpasses GPT-3.5, and rivals GPT-4.
arXiv Detail & Related papers (2023-05-21T20:40:37Z) - MAUVE Scores for Generative Models: Theory and Practice [95.86006777961182]
We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images.
We find that MAUVE can quantify the gaps between the distributions of human-written text and those of modern neural language models.
We demonstrate in the vision domain that MAUVE can identify known properties of generated images on par with or better than existing metrics.
arXiv Detail & Related papers (2022-12-30T07:37:40Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - GAM(e) changer or not? An evaluation of interpretable machine learning
models based on additive model constraints [5.783415024516947]
This paper investigates a series of intrinsically interpretable machine learning models.
We evaluate the prediction qualities of five GAMs as compared to six traditional ML models.
arXiv Detail & Related papers (2022-04-19T20:37:31Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - PermuteAttack: Counterfactual Explanation of Machine Learning Credit
Scorecards [0.0]
This paper is a note on new directions and methodologies for validation and explanation of Machine Learning (ML) models employed for retail credit scoring in finance.
Our proposed framework draws motivation from the field of Artificial Intelligence (AI) security and adversarial ML.
arXiv Detail & Related papers (2020-08-24T00:05:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.