TREX: Tokenizer Regression for Optimal Data Mixture
- URL: http://arxiv.org/abs/2601.13588v1
- Date: Tue, 20 Jan 2026 04:41:09 GMT
- Title: TREX: Tokenizer Regression for Optimal Data Mixture
- Authors: Inho Won, Hangyeol Yoo, Minkyung Cho, Jungyeul Park, Hoyun Song, KyungTae Lim,
- Abstract summary: Tokenizer Regression for Optimal Data MiXture (TREX) is a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training.<n>TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures.<n>TReX's predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency.
- Score: 10.917621429052183
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer's compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TREX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX's predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency, demonstrating strong scalability, robustness, and practical effectiveness.
Related papers
- Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training [16.022416196267937]
We propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios.<n>We show that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost.
arXiv Detail & Related papers (2026-01-31T14:27:46Z) - MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging [72.00014675808228]
textbfMix determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy.<n>Experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning.
arXiv Detail & Related papers (2026-01-25T14:31:57Z) - CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting.<n>We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z) - Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models [24.396525123797073]
We propose a method to optimize language model pre-training data mixtures through efficient approximation of the cross-entropy loss corresponding to each candidate mixture.<n>We use this approximation as a source of additional features in a regression model, trained from observations of model loss for a small number of mixtures.
arXiv Detail & Related papers (2025-02-21T21:27:48Z) - Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data.<n>We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective.<n>We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z) - BiMix: A Bivariate Data Mixing Law for Language Model Pretraining [47.77701041534746]
The impact of pretraining data composition on model performance remains poorly understood.<n>$textbfBiMix$ provides a systematic framework for understanding and optimizing data mixtures.<n>Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency.
arXiv Detail & Related papers (2024-05-23T09:44:02Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.<n>We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.<n>Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - TiMix: Text-aware Image Mixing for Effective Vision-Language
Pre-training [42.142924806184425]
Mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss.
TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods.
arXiv Detail & Related papers (2023-12-14T12:02:24Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.