Related papers: More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

URL: http://arxiv.org/abs/2510.07169v1
Date: Wed, 08 Oct 2025 16:07:26 GMT
Title: More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning
Authors: Yike Zhao, Simin Guo, Ziqing Yang, Shifan Han, Dahua Lin, Fei Tan,
Abstract summary: We conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning.<n>Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume.
Score: 47.13636836547429
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in real-world pipelines remains underexplored. In this work, we conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning, evaluating them under a unified pipeline designed to mirror training and deployment scenarios. We further distill effective data selection strategies and identify practical methods suitable for industrial applications. Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume. This study provides actionable guidance for integrating training data to enhance LLM capabilities, supporting both cost-effective data curation and scalable model enhancement. We hope this work will inspire further research on how to balance "more data" versus "better data" for real-world reasoning tasks.

Related papers

Data Science and Technology Towards AGI Part I: Tiered Data Management [53.64581824953229]
We argue that the development of artificial intelligence is entering a new phase of data-model co-evolution.<n>We introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge.<n>We validate the effectiveness of the proposed framework through empirical studies.
arXiv Detail & Related papers (2026-02-09T18:47:51Z)
A Survey on Efficient Large Language Model Training: From Data-centric Perspectives [42.897899343082806]
We present the first systematic survey of data-efficient Large Language Models post-training from a data-centric perspective.<n>We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems.<n>We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training.
arXiv Detail & Related papers (2025-10-29T17:01:55Z)
Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning Intensity [59.27594125465172]
We introduce Data Reasoning Intensity (DRI), a novel metric that quantifies the latent logical reasoning complexity of samples.<n>We then introduce a re-cognizing optimization strategy that systematically enhances the logical reasoning intensity of training data.
arXiv Detail & Related papers (2025-09-29T14:20:04Z)
SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z)
Data Efficacy for Language Model Training [29.901090317084005]
Data is fundamental to the training of language models (LM)<n>Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data.<n>This work introduces a general paradigm, DELT, for considering data efficacy in LM training.
arXiv Detail & Related papers (2025-06-26T17:59:07Z)
Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study [55.09905978813599]
Large Language Models (LLMs) hold promise in automating data analysis tasks.<n>Yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios.<n>In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs.
arXiv Detail & Related papers (2025-06-24T17:04:23Z)
Data Assetization via Resources-decoupled Federated Learning [7.347554648348435]
Federated learning (FL) provides an effective approach to collaborative training models while preserving privacy.<n>We first propose a framework for resource-decoupled FL involving three parties.<n>Next, we propose the Quality-aware Dynamic Resources-decoupled FL algorithm (QD-RDFL)
arXiv Detail & Related papers (2025-01-24T15:49:04Z)
Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification [7.357494019212501]
We propose efficient weighted-loss approaches to align synthetic data with real-world distribution.<n>We empirically assessed the effectiveness of our method on multiple text classification tasks.
arXiv Detail & Related papers (2024-10-28T20:53:49Z)
A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models. We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.