Cutting Through the Noise: Boosting LLM Performance on Math Word Problems
- URL: http://arxiv.org/abs/2406.15444v3
- Date: Thu, 24 Oct 2024 08:02:14 GMT
- Title: Cutting Through the Noise: Boosting LLM Performance on Math Word Problems
- Authors: Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, Swaroop Mishra,
- Abstract summary: Large Language Models excel at solving math word problems, but struggle with real-world problems containing irrelevant information.
We propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables.
Fine-tuning on adversarial training instances improves performance on adversarial MWPs by 8%.
- Score: 52.99006895757801
- License:
- Abstract: Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and improved ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to 6%.
Related papers
- LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions.
To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline.
Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z) - Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning [11.63133816413199]
Large Language Models (LLMs) have been applied to Math Word Problems (MWPs)
We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models.
We highlight GPT-$o's superior performance in mistake detection and rectification and the persistent challenges faced by smaller models.
arXiv Detail & Related papers (2024-06-16T08:06:05Z) - DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation [39.857198257988685]
Large Language Models (LLMs) have demonstrated remarkable capabilities, revolutionizing the integration of AI in daily life applications.
They are prone to hallucinations, generating claims that contradict established facts, and producing inconsistent responses when the same prompt is presented multiple times.
This paper introduces a comprehensive benchmark dataset comprising over 75,000 prompts across eight domains.
arXiv Detail & Related papers (2024-06-13T14:18:13Z) - Can LLMs Solve longer Math Word Problems Better? [47.227621867242]
Math Word Problems (MWPs) are crucial for evaluating the capability of Large Language Models (LLMs)
This study pioneers the exploration of Context Length Generalizability (CoLeG)
Two novel metrics are proposed to assess the efficacy and resilience of LLMs in solving these problems.
arXiv Detail & Related papers (2024-05-23T17:13:50Z) - Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks.
CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors.
We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z) - What Makes Math Word Problems Challenging for LLMs? [5.153388971862429]
We conduct an in-depth analysis of the key linguistic and mathematical characteristics of math word problems (MWPs)
We train feature-based classifiers to better understand the impact of each feature on the overall difficulty of MWPs for prominent large language models (LLMs)
arXiv Detail & Related papers (2024-03-17T23:18:40Z) - NoMIRACL: Knowing When You Don't Know for Robust Multilingual
Retrieval-Augmented Generation [92.5132418788568]
Retrieval-augmented generation (RAG) grounds large language model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations.
NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages.
We measure robustness using two metrics: (i) hallucination rate, measuring model tendency to hallucinate an answer, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Adversarial Examples for Evaluating Math Word Problem Solvers [4.266990593059533]
Math Word Problem (MWP) solvers have achieved high performance on benchmark datasets.
The extent to which existing MWP solvers truly understand language and its relation with numbers is still unclear.
We generate adversarial attacks to evaluate the robustness of state-of-the-art MWP solvers.
arXiv Detail & Related papers (2021-09-13T12:47:40Z) - Are NLP Models really able to Solve Simple Math Word Problems? [7.433931244705934]
We show that MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs.
We introduce a challenge dataset, SVAMP, created by applying carefully chosen variations over sampled from existing datasets.
The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP, thus showing that much remains to be done even for the simplest of the MWPs.
arXiv Detail & Related papers (2021-03-12T10:23:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.