Exploring the Mystery of Influential Data for Mathematical Reasoning
- URL: http://arxiv.org/abs/2404.01067v2
- Date: Sat, 7 Sep 2024 06:03:06 GMT
- Title: Exploring the Mystery of Influential Data for Mathematical Reasoning
- Authors: Xinzhe Ni, Yeyun Gong, Zhibin Gou, Yelong Shen, Yujiu Yang, Nan Duan, Weizhu Chen,
- Abstract summary: We propose a Quality-aware Diverse Selection (QaDS) strategy for mathematical reasoning.
A comparison with other selection strategies validates the superiority of QaDS.
With OpenMathMix, we achieve a state-of-the-art 48.8% accuracy on MATH with 7B base model.
- Score: 127.61978092016228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Selecting influential data for fine-tuning on downstream tasks is a key factor for both performance and computation efficiency. Recent works have shown that training with only limited data can show a superior performance on general tasks. However, the feasibility on mathematical reasoning tasks has not been validated. To go further, there exist two open questions for mathematical reasoning: how to select influential data and what is an influential data composition. For the former one, we propose a Quality-aware Diverse Selection (QaDS) strategy adaptable for mathematical reasoning. A comparison with other selection strategies validates the superiority of QaDS. For the latter one, we first enlarge our setting and explore the influential data composition. We conduct a series of experiments and highlight: scaling up reasoning data, and training with general data selected by QaDS is helpful. Then, we define our optimal mixture as OpenMathMix, an influential data mixture with open-source data selected by QaDS. With OpenMathMix, we achieve a state-of-the-art 48.8% accuracy on MATH with 7B base model. Additionally, we showcase the use of QaDS in creating efficient fine-tuning mixtures with various selection ratios, and analyze the quality of a wide range of open-source datasets, which can perform as a reference for future works on mathematical reasoning tasks.
Related papers
- Compute-Constrained Data Selection [77.06528009072967]
We formalize the problem of data selection with a cost-aware utility function, and model the problem as trading off initial-selection cost for training gain.
We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute.
arXiv Detail & Related papers (2024-10-21T17:11:21Z) - Curriculum Learning with Quality-Driven Data Selection [6.045582958441303]
OpenAI's GPT-4 has generated significant interest in the development of Multimodal Large Language Models (MLLMs)
We propose a novel data selection methodology that utilizes image-text correlation and model perplexity to evaluate and select data of varying quality.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2024-06-27T07:20:36Z) - MEL: Efficient Multi-Task Evolutionary Learning for High-Dimensional
Feature Selection [11.934379476825551]
We propose a novel approach called PSO-based Multi-task Evolutionary Learning (MEL)
By incorporating information sharing between different feature selection tasks, MEL achieves enhanced learning ability and efficiency.
We evaluate the effectiveness of MEL through extensive experiments on 22 high-dimensional datasets.
arXiv Detail & Related papers (2024-02-14T06:51:49Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - LoBaSS: Gauging Learnability in Supervised Fine-tuning Data [64.27898739929734]
Supervised Fine-Tuning (SFT) serves as a crucial phase in aligning Large Language Models (LLMs) to specific task prerequisites.
We introduce a new dimension in SFT data selection: learnability.
We present the Loss Based SFT Data Selection (LoBaSS) method, utilizing data learnability as the principal criterion for the selection SFT data.
arXiv Detail & Related papers (2023-10-16T07:26:24Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.