DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
- URL: http://arxiv.org/abs/2407.13690v2
- Date: Mon, 23 Dec 2024 17:32:21 GMT
- Title: DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
- Authors: Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, Junxian He,
- Abstract summary: We propose Difficulty-Aware Rejection Tuning (DART), a method that allocates difficult queries more trials during the synthesis phase.
DART allocates difficult queries more trials during the synthesis phase, enabling more extensive training on difficult samples.
We fine-tune various base models on our datasets ranging from 7B to 70B in size, resulting in a series of strong models called DART-MATH.
- Score: 15.815363023014248
- License:
- Abstract: Solving mathematical problems requires advanced reasoning abilities and presents notable challenges for large language models. Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries. Hypothesizing that difficult queries are crucial to learn complex reasoning, we propose Difficulty-Aware Rejection Tuning (DART), a method that allocates difficult queries more trials during the synthesis phase, enabling more extensive training on difficult samples. Utilizing DART, we have created new datasets for mathematical problem-solving that focus more on difficult queries and are substantially smaller than previous ones. Remarkably, our synthesis process solely relies on a 7B-sized open-weight model, without reliance on the commonly used proprietary GPT-4. We fine-tune various base models on our datasets ranging from 7B to 70B in size, resulting in a series of strong models called DART-MATH. In comprehensive in-domain and out-of-domain evaluation on 6 mathematical benchmarks, DART-MATH outperforms vanilla rejection tuning significantly, being superior or comparable to previous arts, despite using much smaller datasets and no proprietary models. Furthermore, our results position our synthetic datasets as the most effective and cost-efficient publicly available resources for advancing mathematical problem-solving.
Related papers
- OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling [9.617742955894247]
Lack of high-quality optimization modeling datasets hampers large language models.
We propose a scalable framework for synthesizing a high-quality dataset, named OptMATH.
We demonstrate that models of various sizes trained on OptMATH achieve superior results on multiple modeling benchmarks.
arXiv Detail & Related papers (2025-02-16T12:38:37Z) - MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models.
We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z) - Advancing Math Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages [13.377908992869814]
Problem-solving data significantly enhances the model's mathematical capabilities compared to general mathematical corpora.
We identify effective data synthesis methods, demonstrating that the tutorship amplification synthesis method achieves the best performance.
arXiv Detail & Related papers (2025-01-23T12:14:57Z) - OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data [8.36384597713879]
The OpenMathInstruct-2 dataset consists of 14M question-solution pairs ($approx$ 600K unique questions)
Finetuning the textttLlama-3.1-8B-Base using OpenMathInstruct-2 outperforms textttLlama3.1-8B-Instruct on MATH by an absolute 15.9%.
To accelerate the open-source efforts, we release the code, the finetuned models, and the OpenMathInstruct-2 dataset under a commercially permissive license.
arXiv Detail & Related papers (2024-10-02T14:00:09Z) - Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization [126.27645170941268]
We present Easy2Hard-Bench, a collection of 6 benchmark datasets spanning various domains.
Each problem within these datasets is annotated with numerical difficulty scores.
We provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty.
arXiv Detail & Related papers (2024-09-27T03:49:56Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - The Unreasonable Effectiveness of Easy Training Data for Hard Tasks [84.30018805150607]
We present the surprising conclusion that current pretrained language models often generalize relatively well from easy to hard data.
We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear heads, and QLoRA.
We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied.
arXiv Detail & Related papers (2024-01-12T18:36:29Z) - Data Factors for Better Compositional Generalization [60.698130703909804]
We conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors.
We show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges.
We explore how training examples of different difficulty levels influence generalization differently.
arXiv Detail & Related papers (2023-11-08T01:27:34Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.