Related papers: Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models

Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models

URL: http://arxiv.org/abs/2410.08068v1
Date: Thu, 10 Oct 2024 16:02:36 GMT
Title: Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models
Authors: Wenting Tan, Dongxiao Chen, Jieting Xue, Zihao Wang, Taijie Chen,
Abstract summary: Large Language Models (LLMs) exhibit impressive performance across various domains but still struggle with arithmetic reasoning tasks. Recent work shows the effectiveness of prompt design methods in enhancing reasoning capabilities. We propose a novel and effective Teaching-Inspired Integrated Framework, which emulates the instructional process of a teacher guiding students.
Score: 8.370453544530914
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) exhibit impressive performance across various domains but still struggle with arithmetic reasoning tasks. Recent work shows the effectiveness of prompt design methods in enhancing reasoning capabilities. However, these approaches overlook crucial requirements for prior knowledge of specific concepts, theorems, and tricks to tackle most arithmetic reasoning problems successfully. To address this issue, we propose a novel and effective Teaching-Inspired Integrated Framework, which emulates the instructional process of a teacher guiding students. This method equips LLMs with essential concepts, relevant theorems, and similar problems with analogous solution approaches, facilitating the enhancement of reasoning abilities. Additionally, we introduce two new Chinese datasets, MathMC and MathToF, both with detailed explanations and answers. Experiments are conducted on nine benchmarks which demonstrates that our approach improves the reasoning accuracy of LLMs. With GPT-4 and our framework, we achieve new state-of-the-art performance on four math benchmarks (AddSub, SVAMP, Math23K and AQuA) with accuracies of 98.2% (+3.3%), 93.9% (+0.2%), 94.3% (+7.2%) and 81.1% (+1.2%). Our data and code are available at https://github.com/SallyTan13/Teaching-Inspired-Prompting.

Related papers

Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
Recursive Decomposition of Logical Thoughts: Framework for Superior Reasoning and Knowledge Propagation in Large Language Models [1.4956870931936515]
We introduce RDoLT, a novel framework that significantly boosts Large Language Models reasoning performance. RDoLT is built on three key innovations: (1) breaking down complex reasoning tasks into sub-tasks of progressive complexity; (2) employing an advanced selection and scoring mechanism to identify the most promising reasoning thoughts; and (3) integrating a knowledge propagation module that mimics human learning. Our approach was evaluated across multiple benchmarks, including GSM8K, SVAMP, MultiArithm, LastLetterConcatenation, and Gaokao2023 Math.
arXiv Detail & Related papers (2025-01-03T02:55:44Z)
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts [7.856746367263317]
This paper introduces the UTMath Benchmark, a robust evaluation framework designed to assess Large Language Models. It comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem. The best-performing model, o1-mini, solving only 32.57% of the problems, followed by o1-preview at 27.16%, and GPT-4o at 26.93%.
arXiv Detail & Related papers (2024-11-11T18:59:02Z)
Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning [7.965282234763401]
Step Guidied Reasoning is more stable and generalizable than few-shot methods. We demonstrate the significant effect of Step Guidied Reasoning in augmenting mathematical performance in state-of-the-art language models.
arXiv Detail & Related papers (2024-10-18T01:38:24Z)
Give me a hint: Can LLMs take a hint to solve math problems? [0.5742190785269342]
We propose giving "hints" to improve the language model's performance on advanced mathematical problems. We also test robustness to adversarial hints and demonstrate their sensitivity to them.
arXiv Detail & Related papers (2024-10-08T11:09:31Z)
BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search [22.672130194493793]
Large Language Models (LLMs) have exhibited exceptional performance across a broad range of tasks and domains. They still encounter difficulties in solving mathematical problems due to the rigorous and logical nature of mathematics. We propose a novel approach, BEATS, to enhance mathematical problem-solving abilities.
arXiv Detail & Related papers (2024-09-26T15:47:42Z)
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks. CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors. We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models [122.19845578690466]
Step-Back Prompting enables LLMs to do abstractions to derive high-level concepts and first principles from instances containing specific details. Using the concepts and principles to guide reasoning, LLMs significantly improve their abilities in following a correct reasoning path towards the solution.
arXiv Detail & Related papers (2023-10-09T19:48:55Z)
MathPrompter: Mathematical Reasoning using Large Language Models [7.953723258038284]
Large Language Models (LLMs) have limited performance when solving arithmetic reasoning tasks. MathPrompter uses the Zero-shot chain-of-thought prompting technique to generate multiple Algebraic expressions or Python functions to solve the same math problem in different ways.
arXiv Detail & Related papers (2023-03-04T04:43:49Z)
Teaching Algorithmic Reasoning via In-context Learning [45.45116247046013]
We show that it is possible to teach algorithmic reasoning to large language models (LLMs) via in-context learning. We evaluate our approach on a variety of arithmetic and quantitative reasoning tasks. We achieve an error reduction of approximately 10x, 9x, 5x and 2x respectively compared to the best available baselines.
arXiv Detail & Related papers (2022-11-15T06:12:28Z)
Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs. We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z)
Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models. We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.