Controlling Equational Reasoning in Large Language Models with Prompt Interventions
- URL: http://arxiv.org/abs/2307.09998v5
- Date: Mon, 13 Jan 2025 17:01:23 GMT
- Title: Controlling Equational Reasoning in Large Language Models with Prompt Interventions
- Authors: Jordan Meadows, Marco Valentino, Andre Freitas,
- Abstract summary: This paper investigates how hallucination rates in Large Language Models (LLMs) may be controlled via a symbolic data generation framework.
We generate data for a derivation generation task using a symbolic engine, applying targeted interventions to prompts to perturb features of mathematical derivations.
We then evaluate the effect of prompt interventions across a range of LLMs including fine-tuned T5 models, GPT, and LLaMa-based models.
- Score: 3.9735602856280132
- License:
- Abstract: This paper investigates how hallucination rates in Large Language Models (LLMs) may be controlled via a symbolic data generation framework, exploring a fundamental relationship between the rate of certain mathematical errors and types of input intervention. Specifically, we systematically generate data for a derivation generation task using a symbolic engine, applying targeted interventions to prompts to perturb features of mathematical derivations such as the surface forms of symbols, equational tree structures, and mathematical context. We then evaluate the effect of prompt interventions across a range of LLMs including fine-tuned T5 models, GPT, and LLaMa-based models. Our experiments suggest that T5-Large can outperform the few-shot performance of GPT-4 on various evaluation sets generated via the framework. However, an extensive evaluation based on human analysis, template-based error detection, and text generation metrics reveals model weaknesses beyond what the reference-based metrics singularly describe. We use these results to tie characteristic distributional footprints of interventions to the human evaluation of LLM derivation quality, potentially leading to significant control over fine-grained mathematical capabilities of language models with respect to specific types of errors.
Related papers
- Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [64.83955753606443]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities.
Current error classification methods rely on static and predefined categories.
We introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples.
arXiv Detail & Related papers (2025-01-26T16:17:57Z) - Visual Error Patterns in Multi-Modal AI: A Statistical Approach [0.0]
Multi-modal large language models (MLLMs) excel at integrating text and visual data but face systematic challenges when interpreting ambiguous or incomplete visual stimuli.
This study leverages statistical modeling to analyze the factors driving these errors, using a dataset of geometric stimuli characterized by features like 3D, rotation, and missing face/side.
arXiv Detail & Related papers (2024-11-27T01:20:08Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - SLEM: Machine Learning for Path Modeling and Causal Inference with Super
Learner Equation Modeling [3.988614978933934]
Causal inference is a crucial goal of science, enabling researchers to arrive at meaningful conclusions using observational data.
Path models, Structural Equation Models (SEMs) and Directed Acyclic Graphs (DAGs) provide a means to unambiguously specify assumptions regarding the causal structure underlying a phenomenon.
We propose Super Learner Equation Modeling, a path modeling technique integrating machine learning Super Learner ensembles.
arXiv Detail & Related papers (2023-08-08T16:04:42Z) - A Mechanistic Interpretation of Arithmetic Reasoning in Language Models
using Causal Mediation Analysis [128.0532113800092]
We present a mechanistic interpretation of Transformer-based LMs on arithmetic questions.
This provides insights into how information related to arithmetic is processed by LMs.
arXiv Detail & Related papers (2023-05-24T11:43:47Z) - A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers [17.075558137261986]
We evaluate the generalisability of Transformers to out-of-distribution mathematical reasoning problems.
We compare the capabilities of GPT-4, GPT-3.5, and a canon of fine-tuned BERT models.
Surprisingly, our evaluation reveals that the average in-distribution performance of fine-tuned models surpasses GPT-3.5, and rivals GPT-4.
arXiv Detail & Related papers (2023-05-21T20:40:37Z) - MAUVE Scores for Generative Models: Theory and Practice [95.86006777961182]
We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images.
We find that MAUVE can quantify the gaps between the distributions of human-written text and those of modern neural language models.
We demonstrate in the vision domain that MAUVE can identify known properties of generated images on par with or better than existing metrics.
arXiv Detail & Related papers (2022-12-30T07:37:40Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - GAM(e) changer or not? An evaluation of interpretable machine learning
models based on additive model constraints [5.783415024516947]
This paper investigates a series of intrinsically interpretable machine learning models.
We evaluate the prediction qualities of five GAMs as compared to six traditional ML models.
arXiv Detail & Related papers (2022-04-19T20:37:31Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - PermuteAttack: Counterfactual Explanation of Machine Learning Credit
Scorecards [0.0]
This paper is a note on new directions and methodologies for validation and explanation of Machine Learning (ML) models employed for retail credit scoring in finance.
Our proposed framework draws motivation from the field of Artificial Intelligence (AI) security and adversarial ML.
arXiv Detail & Related papers (2020-08-24T00:05:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.