Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity
- URL: http://arxiv.org/abs/2509.18577v2
- Date: Mon, 29 Sep 2025 02:13:38 GMT
- Title: Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity
- Authors: Yeongbin Seo, Gayoung Kim, Jaehyung Kim, Jinyoung Yeo,
- Abstract summary: We propose a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics.<n>Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL.<n>Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks.
- Score: 16.521507516831097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000x compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision
Related papers
- Kostant relation in filtered randomized benchmarking for passive bosonic devices [0.0]
We introduce a filter function using immanants.<n>We argue that weak coherent states and intensity measurements are sufficient to proceed with the characterization.
arXiv Detail & Related papers (2025-11-02T07:53:24Z) - Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only [70.43369087819332]
Supervised fine-tuning (SFT) has emerged as a crucial method for aligning large language models with human-annotated demonstrations.<n>We propose Self-Rewarding PPO, a novel fine-tuning method that leverages on-policy techniques to enhance generalization performance.
arXiv Detail & Related papers (2025-10-24T02:02:13Z) - Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning [49.04912820721943]
Supervised fine-tuning (SFT) is computationally expensive and sometimes suffers from overfitting or bias amplification.<n>This work studies the online batch selection family that dynamically scores and filters samples during the training process.<n>We develop textbfUDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT.
arXiv Detail & Related papers (2025-10-19T15:32:01Z) - GLiClass: Generalist Lightweight Model for Sequence Classification Tasks [49.2639069781367]
We propose GLiClass, a novel method that adapts the GLiNER architecture for sequence classification tasks.<n>Our approach achieves strong accuracy and efficiency comparable to embedding-based methods, while maintaining the flexibility needed for zero-shot and few-shot learning scenarios.
arXiv Detail & Related papers (2025-08-11T06:22:25Z) - SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models [1.6816171955882597]
PMPO locates low quality prompt segments via a masking based analysis and iteratively rewrites them to propose improved variants.<n>It selects among variants by minimizing loss in a single forward pass, eliminating output sampling and human or judge based scoring for selection.<n>Across model sizes and datasets, PMPO outperforms prior prompts: it achieves the highest average accuracy on BBH, performs strongly on GSM8K and AQUA RAT, and raises AlpacaEval 2.0 win rates by over 19 points.
arXiv Detail & Related papers (2025-05-22T06:59:10Z) - ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws [67.59263833387536]
ScalingFilter is a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data.
To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations.
arXiv Detail & Related papers (2024-08-15T17:59:30Z) - SpaFL: Communication-Efficient Federated Learning with Sparse Models and Low computational Overhead [75.87007729801304]
SpaFL: a communication-efficient FL framework is proposed to optimize sparse model structures with low computational overhead.<n>To optimize the pruning process itself, only thresholds are communicated between a server and clients instead of parameters.<n>Global thresholds are used to update model parameters by extracting aggregated parameter importance.
arXiv Detail & Related papers (2024-06-01T13:10:35Z) - Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning [43.10197671420528]
We study Superfiltering: Can we use a smaller and weaker model to select data for finetuning a larger and stronger model?
This enables us to use a much smaller and more efficient model to filter the instruction data used to train a larger language model.
Not only does it largely speed up the data filtering, but the filtered-data-finetuned LLM achieves even better performance on standard benchmarks.
arXiv Detail & Related papers (2024-02-01T11:57:53Z) - Revisit Few-shot Intent Classification with PLMs: Direct Fine-tuning vs. Continual Pre-training [20.98770732015944]
Few-shot intent detection involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data.
We show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected.
To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance.
arXiv Detail & Related papers (2023-06-08T15:26:52Z) - Dependency Aware Filter Pruning [74.69495455411987]
Pruning a proportion of unimportant filters is an efficient way to mitigate the inference cost.
Previous work prunes filters according to their weight norms or the corresponding batch-norm scaling factors.
We propose a novel mechanism to dynamically control the sparsity-inducing regularization so as to achieve the desired sparsity.
arXiv Detail & Related papers (2020-05-06T07:41:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.