Related papers: ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws

ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws

URL: http://arxiv.org/abs/2408.08310v1
Date: Thu, 15 Aug 2024 17:59:30 GMT
Title: ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws
Authors: Ruihang Li, Yixuan Wei, Miaosen Zhang, Nenghai Yu, Han Hu, Houwen Peng,
Abstract summary: ScalingFilter is a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data. To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations.
Score: 67.59263833387536
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: High-quality data is crucial for the pre-training performance of large language models. Unfortunately, existing quality filtering methods rely on a known high-quality dataset as reference, which can introduce potential bias and compromise diversity. In this paper, we propose ScalingFilter, a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data, thereby eliminating the influence of the reference dataset in the filtering process. An theoretical analysis shows that ScalingFilter is equivalent to an inverse utilization of scaling laws. Through training models with 1.3B parameters on the same data source processed by various quality filters, we find ScalingFilter can improve zero-shot performance of pre-trained models in downstream tasks. To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations. Extensive experiments reveal that semantic diversity is a reliable indicator of dataset diversity, and ScalingFilter achieves an optimal balance between downstream performance and semantic diversity.

Related papers

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [61.15402517835137]
We build a supervised fine-tuning (SFT) dataset to achieve state-of-the-art coding capability results in models of various sizes. Our models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning.
arXiv Detail & Related papers (2025-04-02T17:50:31Z)
Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric [48.81957145701228]
We propose NovelSum, a new diversity metric based on sample-level "novelty"<n> Experiments on both simulated and real-world data show that NovelSum accurately captures diversity variations and achieves a 0.97 correlation with instruction-tuned model performance.
arXiv Detail & Related papers (2025-02-24T14:20:22Z)
ResoFilter: Fine-grained Synthetic Data Filtering for Large Language Models through Data-Parameter Resonance Analysis [11.300050385809586]
We propose ResoFilter, a novel method that integrates models, data, and tasks to refine datasets. Our experiments demonstrate that ResoFilter achieves comparable results to full-scale fine-tuning. This method provides valuable insights for constructing synthetic datasets and evaluating high-quality data.
arXiv Detail & Related papers (2024-12-19T12:57:47Z)
The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph [45.51085356985464]
We introduce GraphFilter, a novel method that represents the dataset as a bipartite graph, linking sentences to their constituent n-grams. This representation effectively captures the relationships between sentences and linguistic patterns, facilitating the selection of sentences that enhance n-gram diversity. GraphFilter iteratively selects high-priority sentences, updates the bipartite graph by removing covered n-grams, and re-calculates priorities to reflect the evolving data landscape.
arXiv Detail & Related papers (2024-10-16T11:16:34Z)
A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z)
DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality. We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
RankAug: Augmented data ranking for text classification [0.0]
RankAug is a text-ranking approach that detects and filters out the top augmented texts. We demonstrate that the judicious selection of filtering techniques can yield a substantial improvement of up to 35% in classification accuracy for under-represented classes.
arXiv Detail & Related papers (2023-11-08T08:47:49Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
Tradeoffs in Resampling and Filtering for Imbalanced Classification [2.3605348648054454]
We show that different methods of selecting training data bring tradeoffs in effectiveness and efficiency. We also see that in highly imbalanced cases, filtering test data using first-pass retrieval models is as important for model performance as selecting training data.
arXiv Detail & Related papers (2022-08-31T21:40:47Z)
An Empirical Exploration in Quality Filtering of Text Data [0.0]
We find that aggressive filtering can lead to a decrease in model quality on a wide array of downstream tasks for a GPT-like language model. We speculate that this is because optimizing sufficiently strongly for a proxy metric harms performance on the true objective.
arXiv Detail & Related papers (2021-09-02T04:02:51Z)
Mitigating harm in language models with conditional-likelihood filtration [4.002298833349518]
We present a methodology for identifying harmful views from webscale unfiltered datasets. We demonstrate that models trained on this filtered dataset exhibit lower propensity to generate harmful text. We also discuss how trigger phrases which specific values can be used by researchers to build language models which are more closely aligned with their values.
arXiv Detail & Related papers (2021-08-04T22:18:10Z)
Deep Variational Models for Collaborative Filtering-based Recommender Systems [63.995130144110156]
Deep learning provides accurate collaborative filtering models to improve recommender system results. Our proposed models apply the variational concept to injectity in the latent space of the deep architecture. Results show the superiority of the proposed approach in scenarios where the variational enrichment exceeds the injected noise effect.
arXiv Detail & Related papers (2021-07-27T08:59:39Z)
Set Based Stochastic Subsampling [85.5331107565578]
We propose a set-based two-stage end-to-end neural subsampling model that is jointly optimized with an textitarbitrary downstream task network. We show that it outperforms the relevant baselines under low subsampling rates on a variety of tasks including image classification, image reconstruction, function reconstruction and few-shot classification.
arXiv Detail & Related papers (2020-06-25T07:36:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.