Beyond Scale: the Diversity Coefficient as a Data Quality Metric
Demonstrates LLMs are Pre-trained on Formally Diverse Data
- URL: http://arxiv.org/abs/2306.13840v2
- Date: Tue, 26 Sep 2023 23:29:05 GMT
- Title: Beyond Scale: the Diversity Coefficient as a Data Quality Metric
Demonstrates LLMs are Pre-trained on Formally Diverse Data
- Authors: Alycia Lee, Brando Miranda, Sudharsan Sundar, Sanmi Koyejo
- Abstract summary: We use the recently proposed Task2Vec diversity coefficient to ground and understand formal aspects of data quality.
Specifically, we measure the diversity coefficient of publicly available pre-training datasets to demonstrate that their formal diversity is high.
We conclude the diversity coefficient is reliable, show it's high for publicly available LLM datasets, and conjecture it can be used to build useful diverse datasets for LLMs.
- Score: 12.76278784443243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current trends to pre-train capable Large Language Models (LLMs) mostly focus
on scaling of model and dataset size. However, the quality of pre-training data
is an important factor for training powerful LLMs, yet it is a nebulous concept
that has not been fully characterized. Therefore, we use the recently proposed
Task2Vec diversity coefficient to ground and understand formal aspects of data
quality, to go beyond scale alone. Specifically, we measure the diversity
coefficient of publicly available pre-training datasets to demonstrate that
their formal diversity is high when compared to theoretical lower and upper
bounds. In addition, to build confidence in the diversity coefficient, we
conduct interpretability experiments and find that the coefficient aligns with
intuitive properties of diversity, e.g., it increases as the number of latent
concepts increases. We conclude the diversity coefficient is reliable, show
it's high for publicly available LLM datasets, and conjecture it can be used to
build useful diverse datasets for LLMs.
Related papers
- IDEAL: Data Equilibrium Adaptation for Multi-Capability Language Model Alignment [29.703775936837012]
Large Language Models (LLMs) have achieved impressive performance through Supervised Fine-tuning (SFT) on diverse instructional datasets.<n>When training on multiple capabilities simultaneously, the mixture training dataset, governed by volumes of data from different domains, is a critical factor that directly impacts the final model's performance.<n>We introduce an innovative data equilibrium framework designed to effectively optimize volumes of data from different domains within mixture SFT datasets.
arXiv Detail & Related papers (2025-05-19T06:42:44Z) - Evaluating the Diversity and Quality of LLM Generated Content [72.84945252821908]
We introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds.
Although preference-tuned models exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models.
These findings have important implications for applications that require diverse yet high-quality outputs.
arXiv Detail & Related papers (2025-04-16T23:02:23Z) - Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric [48.81957145701228]
We propose a new diversity metric based on sample-level "novelty"
We show that NovelSum accurately captures diversity variations and achieves a 0.97 correlation with instruction-tuned model performance.
arXiv Detail & Related papers (2025-02-24T14:20:22Z) - Exploring the Efficacy of Meta-Learning: Unveiling Superior Data Diversity Utilization of MAML Over Pre-training [1.3980986259786223]
We show that dataset diversity can impact the performance of vision models.
Our study shows positive correlations between test set accuracy and data diversity.
These findings support our hypothesis and demonstrate a promising way for a deeper exploration of how formal data diversity influences model performance.
arXiv Detail & Related papers (2025-01-15T00:56:59Z) - Diversity Over Quantity: A Lesson From Few Shot Relation Classification [62.66895901654023]
We show that training on a diverse set of relations significantly enhances a model's ability to generalize to unseen relations.
We introduce REBEL-FS, a new FSRC benchmark that incorporates an order of magnitude more relation types than existing datasets.
arXiv Detail & Related papers (2024-12-06T21:41:01Z) - On the Diversity of Synthetic Data and its Impact on Training Large Language Models [34.00031258223175]
Large Language Models (LLMs) have accentuated the need for diverse, high-quality pre-training data.
Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility.
We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-10-19T22:14:07Z) - Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics [54.08757792080732]
We propose integrating deep features from pre-trained visual models with a statistical analysis model to achieve opinion-unaware BIQA (OU-BIQA)
Our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models.
arXiv Detail & Related papers (2024-05-29T06:09:34Z) - G-DIG: Towards Gradient-based Diverse and High-quality Instruction Data Selection for Machine Translation [21.506844286376275]
We propose a novel gradient-based method to automatically select high-quality and diverse instruction finetuning data for machine translation.
Our key innovation centers around analyzing how individual training examples influence the model during training.
arXiv Detail & Related papers (2024-05-21T16:38:13Z) - LMD3: Language Model Data Density Dependence [78.76731603461832]
We develop a methodology for analyzing language model task performance at the individual example level based on training data density estimation.
Experiments with paraphrasing as a controlled intervention on finetuning data demonstrate that increasing the support in the training distribution for specific test queries results in a measurable increase in density.
We conclude that our framework can provide statistical evidence of the dependence of a target model's predictions on subsets of its training data.
arXiv Detail & Related papers (2024-05-10T09:03:27Z) - On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes.
Our analysis reveals that the impact of diversified human preferences depends on both model size and data size.
Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them.
arXiv Detail & Related papers (2023-12-12T16:17:15Z) - Role of Structural and Conformational Diversity for Machine Learning
Potentials [4.608732256350959]
We investigate the relationship between data biases and model generalization in Quantum Mechanics.
Our results reveal nuanced patterns in generalization metrics.
These findings provide valuable insights and guidelines for QM data generation efforts.
arXiv Detail & Related papers (2023-10-30T19:33:12Z) - On the Connection between Pre-training Data Diversity and Fine-tuning
Robustness [66.30369048726145]
We find that the primary factor influencing downstream effective robustness is data quantity.
We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources.
arXiv Detail & Related papers (2023-07-24T05:36:19Z) - On the Trade-off of Intra-/Inter-class Diversity for Supervised
Pre-training [72.8087629914444]
We study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset.
With the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity.
arXiv Detail & Related papers (2023-05-20T16:23:50Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.