LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought
- URL: http://arxiv.org/abs/2405.06705v1
- Date: Thu, 9 May 2024 07:37:34 GMT
- Title: LLMs can Find Mathematical Reasoning Mistakes by Pedagogical Chain-of-Thought
- Authors: Zhuoxuan Jiang, Haoyuan Peng, Shanshan Feng, Fan Li, Dongsheng Li,
- Abstract summary: The Pedagogical Chain-of-Thought (PedCoT) is designed to guide the identification of reasoning mistakes.
PedCoT consists of pedagogical principles for prompts (PPP) design, two-stage interaction process (TIP) and grounded PedCoT prompts.
The proposed method can achieve the goal of reliable mathematical mistake identification and provide a foundation for automatic math answer grading.
- Score: 28.122761006724925
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Self-correction is emerging as a promising approach to mitigate the issue of hallucination in Large Language Models (LLMs). To facilitate effective self-correction, recent research has proposed mistake detection as its initial step. However, current literature suggests that LLMs often struggle with reliably identifying reasoning mistakes when using simplistic prompting strategies. To address this challenge, we introduce a unique prompting strategy, termed the Pedagogical Chain-of-Thought (PedCoT), which is specifically designed to guide the identification of reasoning mistakes, particularly mathematical reasoning mistakes. PedCoT consists of pedagogical principles for prompts (PPP) design, two-stage interaction process (TIP) and grounded PedCoT prompts, all inspired by the educational theory of the Bloom Cognitive Model (BCM). We evaluate our approach on two public datasets featuring math problems of varying difficulty levels. The experiments demonstrate that our zero-shot prompting strategy significantly outperforms strong baselines. The proposed method can achieve the goal of reliable mathematical mistake identification and provide a foundation for automatic math answer grading. The results underscore the significance of educational theory, serving as domain knowledge, in guiding prompting strategy design for addressing challenging tasks with LLMs effectively.
Related papers
- Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.
We rigorously analyze both final answers and solution steps to identify reasoning failures.
We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z) - Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [86.79757571440082]
Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks.
We identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts.
We propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts.
arXiv Detail & Related papers (2025-01-30T18:58:18Z) - System-2 Mathematical Reasoning via Enriched Instruction Tuning [13.672967091915181]
Enriched Instruction Tuning (EIT) is a method that enriches existing human-annotated mathematical datasets by synergizing human and AI feedback.
EIT achieves an accuracy of 84.1% on GSM8K and 32.5% on MATH, surpassing state-of-the-art fine-tuning and prompting methods.
arXiv Detail & Related papers (2024-12-22T10:49:27Z) - LLM-based Cognitive Models of Students with Misconceptions [55.29525439159345]
This paper investigates whether Large Language Models (LLMs) can be instruction-tuned to meet this dual requirement.
We introduce MalAlgoPy, a novel Python library that generates datasets reflecting authentic student solution patterns.
Our insights enhance our understanding of AI-based student models and pave the way for effective adaptive learning systems.
arXiv Detail & Related papers (2024-10-16T06:51:09Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Recursive Introspection: Teaching Language Model Agents How to Self-Improve [30.086494067593268]
We develop RISE: Recursive IntroSpEction, an approach for fine-tuning large language models.
Our experiments show that RISE enables Llama2, Llama3, and Mistral models to improve themselves with more turns on math reasoning tasks.
arXiv Detail & Related papers (2024-07-25T17:35:59Z) - LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier.
Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks.
CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors.
We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z) - Distilling Algorithmic Reasoning from LLMs via Explaining Solution Programs [2.3020018305241337]
Distilling explicit chain-of-thought reasoning paths has emerged as an effective method for improving the reasoning abilities of large language models.
We propose a novel approach to distill reasoning abilities from LLMs by leveraging their capacity to explain solutions.
Our experiments demonstrate that learning from explanations enables the Reasoner to more effectively guide program implementation by a Coder.
arXiv Detail & Related papers (2024-04-11T22:19:50Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - Adversarial Math Word Problem Generation [6.92510069380188]
We propose a new paradigm for ensuring fair evaluation of large language models (LLMs)
We generate adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs.
We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability.
arXiv Detail & Related papers (2024-02-27T22:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.