Related papers: MathAttack: Attacking Large Language Models Towards Math Solving Ability

MathAttack: Attacking Large Language Models Towards Math Solving Ability

URL: http://arxiv.org/abs/2309.01686v1
Date: Mon, 4 Sep 2023 16:02:23 GMT
Title: MathAttack: Attacking Large Language Models Towards Math Solving Ability
Authors: Zihao Zhou and Qiufeng Wang and Mingyu Jin and Jie Yao and Jianan Ye and Wei Liu and Wei Wang and Xiaowei Huang and Kaizhu Huang
Abstract summary: We propose a MathAttack model to attack MWP samples which are closer to the essence of security in solving math problems. It is essential to preserve the mathematical logic of original MWPs during the attacking. Extensive experiments on our RobustMath and two another math benchmark GSM8K and MultiAirth datasets show that MathAttack could effectively attack the math solving ability of LLMs.
Score: 29.887497854000276
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the boom of Large Language Models (LLMs), the research of solving Math Word Problem (MWP) has recently made great progress. However, there are few studies to examine the security of LLMs in math solving ability. Instead of attacking prompts in the use of LLMs, we propose a MathAttack model to attack MWP samples which are closer to the essence of security in solving math problems. Compared to traditional text adversarial attack, it is essential to preserve the mathematical logic of original MWPs during the attacking. To this end, we propose logical entity recognition to identify logical entries which are then frozen. Subsequently, the remaining text are attacked by adopting a word-level attacker. Furthermore, we propose a new dataset RobustMath to evaluate the robustness of LLMs in math solving ability. Extensive experiments on our RobustMath and two another math benchmark datasets GSM8K and MultiAirth show that MathAttack could effectively attack the math solving ability of LLMs. In the experiments, we observe that (1) Our adversarial samples from higher-accuracy LLMs are also effective for attacking LLMs with lower accuracy (e.g., transfer from larger to smaller-size LLMs, or from few-shot to zero-shot prompts); (2) Complex MWPs (such as more solving steps, longer text, more numbers) are more vulnerable to attack; (3) We can improve the robustness of LLMs by using our adversarial samples in few-shot prompts. Finally, we hope our practice and observation can serve as an important attempt towards enhancing the robustness of LLMs in math solving ability. We will release our code and dataset.

Related papers

CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective [68.94793547575343]
CogMath formalizes human reasoning process into 3 stages: emphproblem comprehension, emphproblem solving, and emphsolution summarization.<n>In each dimension, we develop an emphInquiry-emphJudge-emphReference'' multi-agent system to generate inquiries that assess LLMs' mastery from this dimension.<n>An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions.
arXiv Detail & Related papers (2025-06-04T22:00:52Z)
LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems [28.72485319617863]
LLMs struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r's in the wordstrawberry. We measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks. Compared with strategies such as finetuning and in-context learning, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks.
arXiv Detail & Related papers (2024-10-18T04:17:16Z)
Give me a hint: Can LLMs take a hint to solve math problems? [0.5742190785269342]
We propose giving "hints" to improve the language model's performance on advanced mathematical problems. We also test robustness to adversarial hints and demonstrate their sensitivity to them.
arXiv Detail & Related papers (2024-10-08T11:09:31Z)
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs [61.74749961334557]
MathHay is an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing models.
arXiv Detail & Related papers (2024-10-07T02:30:07Z)
Cutting Through the Noise: Boosting LLM Performance on Math Word Problems [52.99006895757801]
Large Language Models excel at solving math word problems, but struggle with real-world problems containing irrelevant information. We propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by 8%.
arXiv Detail & Related papers (2024-05-30T18:07:13Z)
Genshin: General Shield for Natural Language Processing with Large Language Models [6.228210545695852]
Large language models (LLMs) have been trending recently, demonstrating considerable advancement and generalizability power in countless domains. LLMs create an even bigger black box exacerbating opacity, with interpretability limited to few approaches. We propose a novel cascading framework called Genshin that combines the generalizability of the LLM, the discrimination of the median model, and the interpretability of the simple model.
arXiv Detail & Related papers (2024-05-29T04:04:05Z)
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks. CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors. We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
Adversarial Math Word Problem Generation [6.92510069380188]
We propose a new paradigm for ensuring fair evaluation of large language models (LLMs) We generate adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs. We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability.
arXiv Detail & Related papers (2024-02-27T22:07:52Z)
InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning [98.53491178426492]
We open-source our math reasoning LLMs InternLM-Math which is continue pre-trained from InternLM2. We unify chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpreter in a unified seq2seq format. Our pre-trained model achieves 30.3 on the MiniF2F test set without fine-tuning.
arXiv Detail & Related papers (2024-02-09T11:22:08Z)
SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs) We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer. We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.