Related papers: The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs

The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs

URL: http://arxiv.org/abs/2602.08351v1
Date: Mon, 09 Feb 2026 07:33:40 GMT
Title: The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs
Authors: Zhiliang Chen, Alfred Wei Lun Leong, Shao Yong Ong, Apivich Hemachandram, Gregory Kang Ruey Lau, Chuan-Sheng Foo, Zhengyuan Liu, Nancy F. Chen, Bryan Kian Hsiang Low,
Abstract summary: JoBS is an approach that uses a scaling-law-inspired performance predictor to aid Bayesian optimization.<n>We study JoBS's average regret and devise the optimal budget allocation to minimize regret.
Score: 86.27977008139435
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Co-optimizing data and model configurations for training LLMs presents a classic chicken-and-egg dilemma: The best training data configuration (e.g., data mixture) for a downstream task depends on the chosen model configuration (e.g., model architecture), and vice versa. However, jointly optimizing both data and model configurations is often deemed intractable, and existing methods focus on either data or model optimization without considering their interaction. We introduce JoBS, an approach that uses a scaling-law-inspired performance predictor to aid Bayesian optimization (BO) in jointly optimizing LLM training data and model configurations efficiently. JoBS allocates a portion of the optimization budget to learn an LLM performance predictor that predicts how promising a training configuration is from a small number of training steps. The remaining budget is used to perform BO entirely with the predictor, effectively amortizing the cost of running full-training runs. We study JoBS's average regret and devise the optimal budget allocation to minimize regret. JoBS outperforms existing multi-fidelity BO baselines, as well as data and model optimization approaches across diverse LLM tasks under the same optimization budget.

Related papers

MALBO: Optimizing LLM-Based Multi-Agent Teams via Multi-Objective Bayesian Optimization [0.0]
This thesis introduces MALBO, a systematic framework designed to automate the efficient composition of multi-agent AI teams.<n>We formalize the assignment challenge as a multi-objective optimization problem, aiming to identify the front of configurations between task accuracy and inference cost.<n>Our results demonstrate that the Bayesian optimization phase, compared to an initial random search, maintained a comparable average performance while reducing the average configuration cost by over 45%.
arXiv Detail & Related papers (2025-11-14T18:01:08Z)
BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining [28.32850393150554]
BLISS is a lightweight data selection method that operates entirely emphfrom scratch, without relying on any external pretrained oracle models.<n>We validate BLISS by pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on selected subsets of the C4 dataset.<n> BLISS achieves $1.7times$ speedup in reaching the same performance as the state-of-the-art method.
arXiv Detail & Related papers (2025-10-07T15:42:33Z)
Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization [37.54165341391688]
We introduce a novel problem: Sample Scheduling for DPO.<n>We propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch.<n>This work points to a promising new direction for improving LLM alignment through batch-wise sample selection.
arXiv Detail & Related papers (2025-06-08T10:26:09Z)
Large Language Models are Demonstration Pre-Selectors for Themselves [57.101804269100185]
In-context learning (ICL) with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training data.<n>FEw yet Essential Demonstration prE-selectoR is a novel pre-selection framework that identifies a representative subset of demonstrations.<n>FEw yet Essential Demonstration prE-selectoR can reduce training data size by over 20% while maintaining performance.
arXiv Detail & Related papers (2025-06-06T12:29:03Z)
Cost-Optimal Grouped-Query Attention for Long-Context Modeling [45.981681856747365]
Grouped-Query Attention (GQA) is a widely adopted strategy for reducing the computational cost of attention layers in large language models.<n>We analyze the relationship among context length, model size, GQA configuration, and model loss.<n>We propose a recipe for deriving cost-optimal GQA configurations.
arXiv Detail & Related papers (2025-03-12T17:50:42Z)
Understanding the Performance and Estimating the Cost of LLM Fine-Tuning [9.751868268608675]
Fine-tuning Large Language Models (LLMs) for specific tasks in a cost-effective manner. In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance. We also develop and validate an analytical model to estimate the cost of LLM fine-tuning on the cloud.
arXiv Detail & Related papers (2024-08-08T16:26:07Z)
AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs [59.12061830645018]
We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales.<n>We propose AutoScale, a two-stage, scale-aware data composition framework.
arXiv Detail & Related papers (2024-07-29T17:06:30Z)
Value Augmented Sampling for Language Model Alignment and Personalization [39.070662999014836]
We present a new framework for reward optimization, Value Augmented Sampling (VAS) VAS solves for the optimal reward-maximizing policy without co-training the policy and the value function. Our algorithm unlocks the new capability of composing several rewards and controlling the extent of each one during deployment time.
arXiv Detail & Related papers (2024-05-10T17:59:04Z)
CoLLiE: Collaborative Training of Large Language Models in an Efficient Way [59.09824823710863]
CoLLiE is an efficient library that facilitates collaborative training of large language models. With its modular design and comprehensive functionality, CoLLiE offers a balanced blend of efficiency, ease of use, and customization.
arXiv Detail & Related papers (2023-12-01T08:02:16Z)
MILO: Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training. Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z)
Conservative Objective Models for Effective Offline Model-Based Optimization [78.19085445065845]
Computational design problems arise in a number of settings, from synthetic biology to computer architectures. We propose a method that learns a model of the objective function that lower bounds the actual value of the ground-truth objective on out-of-distribution inputs. COMs are simple to implement and outperform a number of existing methods on a wide range of MBO problems.
arXiv Detail & Related papers (2021-07-14T17:55:28Z)
Bayesian Optimization for Selecting Efficient Machine Learning Models [53.202224677485525]
We present a unified Bayesian Optimization framework for jointly optimizing models for both prediction effectiveness and training efficiency. Experiments on model selection for recommendation tasks indicate models selected this way significantly improves model training efficiency.
arXiv Detail & Related papers (2020-08-02T02:56:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.