Related papers: LLMs can implicitly learn from mistakes in-context

LLMs can implicitly learn from mistakes in-context

URL: http://arxiv.org/abs/2502.08550v1
Date: Wed, 12 Feb 2025 16:31:21 GMT
Title: LLMs can implicitly learn from mistakes in-context
Authors: Lisa Alazraki, Maximilian Mozes, Jon Ander Campos, Yi Chern Tan, Marek Rei, Max Bartolo,
Abstract summary: We investigate whether Large Language Models (LLMs) can learn from mistakes in mathematical reasoning tasks when explanations are not provided.<n>Surprisingly, we find that LLMs perform better, on average, when rationales are eliminated from the context.<n>This approach also substantially outperforms chain-of-thought prompting in our evaluations.
Score: 15.818061010632249
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning from mistakes is a fundamental feature of human intelligence. Previous work has shown that Large Language Models (LLMs) can also learn from incorrect answers when provided with a comprehensive rationale detailing why an answer is wrong or how to correct it. In this work, we examine whether LLMs can learn from mistakes in mathematical reasoning tasks when these explanations are not provided. We investigate if LLMs are able to implicitly infer such rationales simply from observing both incorrect and correct answers. Surprisingly, we find that LLMs perform better, on average, when rationales are eliminated from the context and incorrect answers are simply shown alongside correct ones. This approach also substantially outperforms chain-of-thought prompting in our evaluations. We show that these results are consistent across LLMs of different sizes and varying reasoning abilities. Further, we carry out an in-depth analysis, and show that prompting with both wrong and correct answers leads to greater performance and better generalisation than introducing additional, more diverse question-answer pairs into the context. Finally, we show that new rationales generated by models that have only observed incorrect and correct answers are scored equally as highly by humans as those produced with the aid of exemplar rationales. Our results demonstrate that LLMs are indeed capable of in-context implicit learning.

Related papers

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology [13.964263002704582]
We show that, even with the use of Chains of Thought prompts, mainstream LLMs have a high error rate when solving modified CRT problems. Specifically, the average accuracy rate dropped by up to 50% compared to the original questions. This finding challenges the belief that LLMs have genuine mathematical reasoning abilities comparable to humans.
arXiv Detail & Related papers (2024-10-19T05:01:56Z)
Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs. We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z)
When Do LLMs Need Retrieval Augmentation? Mitigating LLMs' Overconfidence Helps Retrieval Augmentation [66.01754585188739]
Large Language Models (LLMs) have been found to have difficulty knowing they do not possess certain knowledge. Retrieval Augmentation (RA) has been extensively studied to mitigate LLMs' hallucinations. We propose several methods to enhance LLMs' perception of knowledge boundaries and show that they are effective in reducing overconfidence.
arXiv Detail & Related papers (2024-02-18T04:57:19Z)
Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs [52.42505579545893]
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought explanations alongside answers. We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT.
arXiv Detail & Related papers (2024-02-17T05:22:56Z)
LLMs cannot find reasoning errors, but can correct them given the error location [0.9017736137562115]
Poor self-correction performance stems from LLMs' inability to find logical mistakes, rather than their ability to correct a known mistake. We benchmark several state-of-the-art LLMs on their mistake-finding ability and demonstrate that they generally struggle with the task. We show that it is possible to obtain mistake location information without ground truth labels or in-domain training data.
arXiv Detail & Related papers (2023-11-14T20:12:38Z)
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans. RaR serves as a simple yet effective prompting method for improving performance. We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z)
Learning From Mistakes Makes LLM Better Reasoner [106.48571828587728]
Large language models (LLMs) recently exhibited remarkable reasoning capabilities on solving math problems. This work explores whether LLMs can LEarn from MistAkes (LEMA), akin to the human learning process.
arXiv Detail & Related papers (2023-10-31T17:52:22Z)
Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong [35.64962031447787]
Large Language Models (LLMs) are increasingly used for accessing information on the web. Our experiments with 80 crowdworkers compare language models with search engines (information retrieval systems) at facilitating fact-checking. Users reading LLM explanations are significantly more efficient than those using search engines while achieving similar accuracy.
arXiv Detail & Related papers (2023-10-19T08:09:58Z)
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [55.76083560152823]
SelfCheck is a general-purpose zero-shot verification schema for recognizing errors in step-by-step reasoning. We test SelfCheck on three datasets (GSM8K, MathQA, and MATH) and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
arXiv Detail & Related papers (2023-08-01T10:31:36Z)
Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks. LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes. We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.