Olmix: A Framework for Data Mixing Throughout LM Development
- URL: http://arxiv.org/abs/2602.12237v1
- Date: Thu, 12 Feb 2026 18:16:05 GMT
- Title: Olmix: A Framework for Data Mixing Throughout LM Development
- Authors: Mayee F. Chen, Tyler Murray, David Heineman, Matt Jordan, Hannaneh Hajishirzi, Christopher RĂ©, Luca Soldaini, Kyle Lo,
- Abstract summary: Olmix is a framework that addresses the problem of data mixing in training language models.<n>Design choices across existing methods lack justification or consensus, overlook practical issues like data constraints.<n>We introduce mixture reuse, a mechanism that reuses existing ratios and recomputes ratios only for domains affected by the update.
- Score: 90.12613780066063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data mixing -- determining the ratios of data from different domains -- is a first-order concern for training language models (LMs). While existing mixing methods show promise, they fall short when applied during real-world LM development. We present Olmix, a framework that addresses two such challenges. First, the configuration space for developing a mixing method is not well understood -- design choices across existing methods lack justification or consensus and overlook practical issues like data constraints. We conduct a comprehensive empirical study of this space, identifying which design choices lead to a strong mixing method. Second, in practice, the domain set evolves throughout LM development as datasets are added, removed, partitioned, and revised -- a problem setting largely unaddressed by existing works, which assume fixed domains. We study how to efficiently recompute the mixture after the domain set is updated, leveraging information from past mixtures. We introduce mixture reuse, a mechanism that reuses existing ratios and recomputes ratios only for domains affected by the update. Over a sequence of five domain-set updates mirroring real-world LM development, mixture reuse matches the performance of fully recomputing the mix after each update with 74% less compute and improves over training without mixing by 11.6% on downstream tasks.
Related papers
- MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training [54.78779514101305]
MaD-Mix is a principled framework that derives multi-modal data mixtures for VLM training.<n>MaD-Mix speeds VLM training across diverse benchmarks.<n>In complex tri-modal video-image-text scenarios, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture overhead.
arXiv Detail & Related papers (2026-02-08T03:07:36Z) - FMIP: Joint Continuous-Integer Flow For Mixed-Integer Linear Programming [52.52020895303244]
Mixed-Integer Linear Programming (MILP) is a foundational tool for complex decision-making problems.<n>We propose Joint Continuous-Integer Flow for Mixed-Integer Linear Programming (FMIP), which is the first generative framework that models joint distribution of both integer and continuous variables for MILP solutions.<n>FMIP is fully compatible with arbitrary backbone networks and various downstream solvers, making it well-suited for a broad range of real-world MILP applications.
arXiv Detail & Related papers (2025-07-31T10:03:30Z) - Merge to Mix: Mixing Datasets via Model Merging [2.990932417718553]
Mixing datasets for fine-tuning large models (LMs) has become critical for maximizing performance on downstream tasks.<n>We propose a novel method, $textitMerge to Mix$, that accelerates composing dataset mixtures through model merging.<n>Our experiments demonstrate that Merge to Mix surpasses state-of-the-art methods in dataset selection for fine-tuning LMs.
arXiv Detail & Related papers (2025-05-21T22:34:13Z) - BiMix: A Bivariate Data Mixing Law for Language Model Pretraining [47.77701041534746]
The impact of pretraining data composition on model performance remains poorly understood.<n>$textbfBiMix$ provides a systematic framework for understanding and optimizing data mixtures.<n>Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency.
arXiv Detail & Related papers (2024-05-23T09:44:02Z) - PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis [71.8946280170493]
This paper introduces PowMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches.
PowMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer.
arXiv Detail & Related papers (2023-12-19T17:01:58Z) - Mixture Weight Estimation and Model Prediction in Multi-source
Multi-target Domain Adaptation [22.933419188759707]
We consider the problem of learning a model from multiple heterogeneous sources.
The goal of learner is to mix these data sources in a target-distribution aware way.
arXiv Detail & Related papers (2023-09-19T16:29:34Z) - Supervision Interpolation via LossMix: Generalizing Mixup for Object
Detection and Beyond [10.25372189905226]
LossMix is a simple yet versatile and effective regularization that enhances the performance and robustness of object detectors.
Empirical results on the PASCAL VOC and MS COCO datasets demonstrate that LossMix can consistently outperform state-of-the-art methods for detection.
arXiv Detail & Related papers (2023-03-18T06:13:30Z) - A Survey of Mix-based Data Augmentation: Taxonomy, Methods, Applications, and Explainability [29.40977854491399]
Data augmentation (DA) is indispensable in modern machine learning and deep neural networks.
This survey comprehensively reviews a crucial subset of DA techniques, namely Mix-based Data Augmentation (MixDA)
In contrast to traditional DA approaches that operate on single samples or entire datasets, MixDA stands out due to its effectiveness, simplicity, flexibility, computational efficiency, theoretical foundation, and broad applicability.
arXiv Detail & Related papers (2022-12-21T09:58:14Z) - FIXED: Frustratingly Easy Domain Generalization with Mixup [53.782029033068675]
Domain generalization (DG) aims to learn a generalizable model from multiple training domains such that it can perform well on unseen target domains.
A popular strategy is to augment training data to benefit generalization through methods such as Mixupcitezhang 2018mixup.
We propose a simple yet effective enhancement for Mixup-based DG, namely domain-invariant Feature mIXup (FIX)
Our approach significantly outperforms nine state-of-the-art related methods, beating the best performing baseline by 6.5% on average in terms of test accuracy.
arXiv Detail & Related papers (2022-11-07T09:38:34Z) - Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data.
In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM)
DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.