Robustness Assessment of Mathematical Reasoning in the Presence of Missing and Contradictory Conditions
- URL: http://arxiv.org/abs/2406.05055v1
- Date: Fri, 7 Jun 2024 16:24:12 GMT
- Title: Robustness Assessment of Mathematical Reasoning in the Presence of Missing and Contradictory Conditions
- Authors: Shi-Yu Tian, Zhi Zhou, Lin-Han Jia, Lan-Zhe Guo, Yu-Feng Li,
- Abstract summary: We develop a benchmark called Problems with Missing and Contradictory conditions (PMC)
We introduce two novel metrics to evaluate the performance of few-shot prompting methods in these scenarios.
We propose a novel few-shot prompting method called SMT-LIB Prompting (SLP), which utilizes the SMT-LIB language to model the problems instead of solving them directly.
- Score: 48.251724997889184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated impressive performance on reasoning tasks, which can be further improved through few-shot prompting techniques. However, the current evaluation primarily focuses on carefully constructed benchmarks and neglects the consideration of real-world reasoning problems that present missing and contradictory conditions, known as ill-defined problems. Our observations suggest that existing few-shot prompting techniques are ineffective in such scenarios, often providing overconfident answers or hallucination. To further study this problem, we develop a benchmark called Problems with Missing and Contradictory conditions (PMC) and introduce two novel metrics to evaluate the performance of few-shot prompting methods in these scenarios. Our analysis using the PMC benchmark reveals a trade-off dilemma between the performance of mathematical reasoning for well-defined problems and the ability to recognize ill-defined problems. To address the challenges posed by PMC, we propose a novel few-shot prompting method called SMT-LIB Prompting (SLP), which utilizes the SMT-LIB language to model the problems instead of solving them directly. Subsequently, a double-check solving strategy checks the satisfiability and uniqueness of the solution and provides final feedback. Extensive experiments demonstrate the superiority of our SLP approach compared to existing few-shot prompting methods when dealing with problems with missing and contradictory conditions. We will open-source our benchmark and code to facilitate future research.
Related papers
- Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [85.51252685938564]
Uncertainty quantification (UQ) is becoming increasingly recognized as a critical component of applications that rely on machine learning (ML)
As with other ML models, large language models (LLMs) are prone to make incorrect predictions, hallucinate'' by fabricating claims, or simply generate low-quality output for a given input.
We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines, and provides an environment for controllable and consistent evaluation of novel techniques.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - Sound Heuristic Search Value Iteration for Undiscounted POMDPs with Reachability Objectives [16.101435842520473]
This paper studies the challenging yet important problem in POMDPs known as the (indefinite-horizon) Maximal Reachability Probability Problem.
Inspired by the success of point-based methods developed for discounted problems, we study their extensions to MRPP.
We present a novel algorithm that leverages the strengths of these techniques for efficient exploration of the belief space.
arXiv Detail & Related papers (2024-06-05T02:33:50Z) - Chain of Thoughtlessness? An Analysis of CoT in Planning [17.329365493094542]
Large language model (LLM) performance on reasoning problems typically does not generalize out of distribution.
This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain.
We find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class.
arXiv Detail & Related papers (2024-05-08T02:48:28Z) - Plan of Thoughts: Heuristic-Guided Problem Solving with Large Language Models [0.0]
We formalize a planning-based approach to perform multi-step problem solving with language models.
We demonstrate a superior success rate of 89.4% on the Game of 24 task as compared to existing approaches.
arXiv Detail & Related papers (2024-04-29T18:51:17Z) - PECC: Problem Extraction and Coding Challenges [3.287942619833188]
We introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler.
Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate code.
Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset.
arXiv Detail & Related papers (2024-04-29T15:02:14Z) - Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces.
We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered.
Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z) - Thought Propagation: An Analogical Approach to Complex Reasoning with Large Language Models [62.96551299003463]
We propose textbftextitThought Propagation (TP) to enhance the complex reasoning ability of Large Language Models.
TP first prompts LLMs to propose and solve a set of analogous problems that are related to the input one.
TP reuses the results of analogous problems to directly yield a new solution or derive a knowledge-intensive plan for execution to amend the initial solution obtained from scratch.
arXiv Detail & Related papers (2023-10-06T01:40:09Z) - Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.89346248535922]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z) - Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement [50.62461749446111]
Self-Polish (SP) is a novel method that facilitates the model's reasoning by guiding it to progressively refine the given problems to be more comprehensible and solvable.
SP is to all other prompting methods of answer/reasoning side like CoT, allowing for seamless integration with state-of-the-art techniques for further improvement.
arXiv Detail & Related papers (2023-05-23T19:58:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.