Adversarial Math Word Problem Generation
- URL: http://arxiv.org/abs/2402.17916v3
- Date: Sat, 15 Jun 2024 22:36:20 GMT
- Title: Adversarial Math Word Problem Generation
- Authors: Roy Xie, Chengxuan Huang, Junlin Wang, Bhuwan Dhingra,
- Abstract summary: We propose a new paradigm for ensuring fair evaluation of large language models (LLMs)
We generate adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs.
We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability.
- Score: 6.92510069380188
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have significantly transformed the educational landscape. As current plagiarism detection tools struggle to keep pace with LLMs' rapid advancements, the educational community faces the challenge of assessing students' true problem-solving abilities in the presence of LLMs. In this work, we explore a new paradigm for ensuring fair evaluation -- generating adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs. Focusing on the domain of math word problems, we leverage abstract syntax trees to structurally generate adversarial examples that cause LLMs to produce incorrect answers by simply editing the numeric values in the problems. We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability. We identify shared vulnerabilities among LLMs and propose a cost-effective approach to attack high-cost models. Additionally, we conduct automatic analysis to investigate the cause of failure, providing further insights into the limitations of LLMs.
Related papers
- LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems [28.72485319617863]
LLMs struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r's in the wordstrawberry.
We measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks.
Compared with strategies such as finetuning and in-context learning, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks.
arXiv Detail & Related papers (2024-10-18T04:17:16Z) - Are You Human? An Adversarial Benchmark to Expose LLMs [2.6528263069045126]
Large Language Models (LLMs) have demonstrated an alarming ability to impersonate humans in conversation.
We evaluate text-based prompts designed as challenges to expose LLM imposters in real-time.
arXiv Detail & Related papers (2024-10-12T15:33:50Z) - Can LLMs Solve longer Math Word Problems Better? [47.227621867242]
Math Word Problems (MWPs) are crucial for evaluating the capability of Large Language Models (LLMs)
This study pioneers the exploration of Context Length Generalizability (CoLeG)
Two novel metrics are proposed to assess the efficacy and resilience of LLMs in solving these problems.
arXiv Detail & Related papers (2024-05-23T17:13:50Z) - Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [56.75702900542643]
We introduce AlphaLLM for the self-improvements of Large Language Models.
It integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop.
Our experimental results show that AlphaLLM significantly enhances the performance of LLMs without additional annotations.
arXiv Detail & Related papers (2024-04-18T15:21:34Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Automatically Correcting Large Language Models: Surveying the landscape
of diverse self-correction strategies [104.32199881187607]
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks.
A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output.
This paper presents a comprehensive review of this emerging class of techniques.
arXiv Detail & Related papers (2023-08-06T18:38:52Z) - SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM)
SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains.
The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z) - Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.