Related papers: Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

URL: http://arxiv.org/abs/2408.16293v1
Date: Thu, 29 Aug 2024 06:49:20 GMT
Title: Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems
Authors: Tian Ye, Zicheng Xu, Yuanzhi Li, Zeyuan Allen-Zhu,
Abstract summary: We focus on understanding the usefulness of incorporating "error-correction" data directly into the pretraining stage. This data consists of erroneous solution steps immediately followed by their corrections. We show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy.
Score: 47.753284211200665
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language models have demonstrated remarkable performance in solving reasoning tasks; however, even the strongest models still occasionally make reasoning mistakes. Recently, there has been active research aimed at improving reasoning accuracy, particularly by using pretrained language models to "self-correct" their mistakes via multi-round prompting. In this paper, we follow this line of work but focus on understanding the usefulness of incorporating "error-correction" data directly into the pretraining stage. This data consists of erroneous solution steps immediately followed by their corrections. Using a synthetic math dataset, we show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy directly (i.e., through simple auto-regression, without multi-round prompting) compared to pretraining on the same amount of error-free data. We also delve into many details, such as (1) how this approach differs from beam search, (2) how such data can be prepared, (3) whether masking is needed on the erroneous tokens, (4) the amount of error required, (5) whether such data can be deferred to the fine-tuning stage, and many others.

Related papers

Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks [24.55929874173401]
We show that a language model's reasoning capabilities can be improved by training on datasets of chain-of-thought traces from more capable models.<n>Experiments show this approach can yield better performance on reasoning tasks than training on human-annotated datasets.
arXiv Detail & Related papers (2025-12-24T07:35:55Z)
Statistically Testing Training Data for Unwanted Error Patterns using Rule-Oriented Regression [0.5831737970661137]
We provide a method to test training data for flaws, to establish a trustworthy ground-truth for a subsequent training of machine learning models. Our approach extends the abilities of conventional statistical testing by letting the test-condition'' be any condition to describe a pattern in the data. We provide an open source implementation for demonstration and experiments.
arXiv Detail & Related papers (2025-03-24T09:52:36Z)
Learning from Mistakes: Self-correct Adversarial Training for Chinese Unnatural Text Correction [6.426690600216749]
Unnatural text correction aims to automatically detect and correct spelling errors or adversarial perturbation errors in sentences. Existing methods rely on fine-tuning or adversarial training to correct errors. We propose a self-correct adversarial training framework for textbfLearntextbfIng from textbfMIstextbfTakes.
arXiv Detail & Related papers (2024-12-23T04:58:58Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE) RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation. Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth. We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z)
Annotating and Modeling Fine-grained Factuality in Summarization [36.88018450067003]
A major barrier to their use in practice is their propensity to output summaries that are not faithful to the input and that contain factual errors. We explore both synthetic and human-labeled data sources for training models to identify factual errors in summarization. We show that our best factuality detection model enables training of more factual XSum summarization models by allowing us to identify non-factual tokens in the training data.
arXiv Detail & Related papers (2021-04-09T11:20:44Z)
On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
Correcting the Autocorrect: Context-Aware Typographical Error Correction via Training Data Augmentation [38.10429793534442]
We first draw on a small set of annotated data to compute spelling error statistics. These are then invoked to introduce errors into substantially larger corpora. We use it to create a set of English language error detection and correction datasets.
arXiv Detail & Related papers (2020-05-03T18:08:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.