Related papers: Investigating the Robustness of LLMs on Math Word Problems

Investigating the Robustness of LLMs on Math Word Problems

URL: http://arxiv.org/abs/2406.15444v1
Date: Thu, 30 May 2024 18:07:13 GMT
Title: Investigating the Robustness of LLMs on Math Word Problems
Authors: Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, Swaroop Mishra,
Abstract summary: Large Language Models excel at solving math word problems, but struggle with real-world problems containing irrelevant information. We propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by 8%.
Score: 52.99006895757801
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, ProbleMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and better ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to ~6%.

Related papers

SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework [81.29965270493238]
We develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) for wireless communication applications. The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard. We introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data.
arXiv Detail & Related papers (2025-01-16T16:19:53Z)
R-MTLLMF: Resilient Multi-Task Large Language Model Fusion at the Wireless Edge [78.26352952957909]
Multi-task large language models (MTLLMs) are important for many applications at the wireless edge, where users demand specialized models to handle multiple tasks efficiently. The concept of model fusion via task vectors has emerged as an efficient approach for combining fine-tuning parameters to produce an MTLLM. In this paper, the problem of enabling edge users to collaboratively craft such MTLMs via tasks vectors is studied, under the assumption of worst-case adversarial attacks.
arXiv Detail & Related papers (2024-11-27T10:57:06Z)
Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment [56.87031484108484]
Large Language Models (LLMs) are increasingly recognized for their practical applications. Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs. By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs.
arXiv Detail & Related papers (2024-11-09T15:12:28Z)
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z)
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning [11.63133816413199]
Large Language Models (LLMs) have been applied to Math Word Problems (MWPs) We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. We highlight GPT-$o's superior performance in mistake detection and rectification and the persistent challenges faced by smaller models.
arXiv Detail & Related papers (2024-06-16T08:06:05Z)
DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation [39.857198257988685]
Large Language Models (LLMs) have demonstrated remarkable capabilities, revolutionizing the integration of AI in daily life applications. They are prone to hallucinations, generating claims that contradict established facts, and producing inconsistent responses when the same prompt is presented multiple times. This paper introduces a comprehensive benchmark dataset comprising over 75,000 prompts across eight domains.
arXiv Detail & Related papers (2024-06-13T14:18:13Z)
Can LLMs Solve longer Math Word Problems Better? [47.227621867242]
Math Word Problems (MWPs) are crucial for evaluating the capability of Large Language Models (LLMs) This study pioneers the exploration of Context Length Generalizability (CoLeG) Two novel metrics are proposed to assess the efficacy and resilience of LLMs in solving these problems.
arXiv Detail & Related papers (2024-05-23T17:13:50Z)
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks. CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors. We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z)
What Makes Math Word Problems Challenging for LLMs? [5.153388971862429]
We conduct an in-depth analysis of the key linguistic and mathematical characteristics of math word problems (MWPs) We train feature-based classifiers to better understand the impact of each feature on the overall difficulty of MWPs for prominent large language models (LLMs)
arXiv Detail & Related papers (2024-03-17T23:18:40Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
Adversarial Examples for Evaluating Math Word Problem Solvers [4.266990593059533]
Math Word Problem (MWP) solvers have achieved high performance on benchmark datasets. The extent to which existing MWP solvers truly understand language and its relation with numbers is still unclear. We generate adversarial attacks to evaluate the robustness of state-of-the-art MWP solvers.
arXiv Detail & Related papers (2021-09-13T12:47:40Z)
Are NLP Models really able to Solve Simple Math Word Problems? [7.433931244705934]
We show that MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. We introduce a challenge dataset, SVAMP, created by applying carefully chosen variations over sampled from existing datasets. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP, thus showing that much remains to be done even for the simplest of the MWPs.
arXiv Detail & Related papers (2021-03-12T10:23:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.