Optimal Budgeted Adaptation of Large Language Models
- URL: http://arxiv.org/abs/2602.00952v1
- Date: Sun, 01 Feb 2026 01:08:15 GMT
- Title: Optimal Budgeted Adaptation of Large Language Models
- Authors: Jing Wang, Jie Shen, Dean Foster, Zohar Karnin, Jeremy C Weiss,
- Abstract summary: Trade-off between labeled data availability and downstream accuracy remains a central challenge in large language models.<n>We propose a principled framework for emphbudget-aware supervised fine-tuning by casting LLM adaptation as a contextual Stackelberg game.
- Score: 8.207283283207138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The trade-off between labeled data availability and downstream accuracy remains a central challenge in fine-tuning large language models (LLMs). We propose a principled framework for \emph{budget-aware supervised fine-tuning} by casting LLM adaptation as a contextual Stackelberg game. In our formulation, the learner (leader) commits to a scoring policy and a label-querying strategy, while an adaptive environment (follower) selects challenging supervised alternatives in response. To explicitly address label efficiency, we incorporate a finite supervision budget directly into the learning objective. Our algorithm operates in the full-feedback regime and achieves $\tilde{O}(d\sqrt{T})$ regret under standard linear contextual assumptions. We extend the framework with a Largest-Latency-First (LLF) confidence gate that selectively queries labels, achieving a budget-aware regret bound of $\tilde{O}(\sqrt{dB} + c\sqrt{B})$ with $B=βT$.
Related papers
- $\
abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z) - In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback [38.915062716409686]
InTRO is a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning.<n>InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model.<n>Its chains of thought are notably more concise, exhibiting reduced verbosity.
arXiv Detail & Related papers (2025-11-13T01:47:06Z) - Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback [0.8287206589886881]
We study the Logistic Contextual Slate Bandit problem.<n>A single binary reward, determined by a logistic model, is observed for the chosen slate.<n>We propose two algorithms, Slate-GLM-OFU and Slate-GLM-TS, that accomplish this goal.
arXiv Detail & Related papers (2025-06-16T07:19:02Z) - Scalable Chain of Thoughts via Elastic Reasoning [61.75753924952059]
Elastic Reasoning is a novel framework for scalable chain of thoughts.<n>It separates reasoning into two phases--thinking and solution--with independently allocated budgets.<n>Our approach produces more concise and efficient reasoning even in unconstrained settings.
arXiv Detail & Related papers (2025-05-08T15:01:06Z) - Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [55.15106182268834]
Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models.<n>It faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive.<n>We introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts.
arXiv Detail & Related papers (2025-04-18T17:49:55Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - $\texttt{SEM-CTRL}$: Semantically Controlled Decoding [53.86639808659575]
$texttSEM-CTRL$ is a unified approach that enforces rich context-sensitive constraints and task- and instance-specific semantics directly on an LLM decoder.<n>texttSEM-CTRL$ allows small pre-trained LLMs to efficiently outperform larger variants and state-of-the-art reasoning models.
arXiv Detail & Related papers (2025-03-03T18:33:46Z) - Sparse Linear Bandits with Blocking Constraints [22.01704171400845]
We investigate the high-dimensional sparse linear bandits problem in a data-poor regime.<n>We show novel offline statistical guarantees of the lasso estimator for the linear model.<n>We propose a meta-algorithm based on corralling that does not need knowledge of optimal sparsity parameter $k$ at minimal cost to regret.
arXiv Detail & Related papers (2024-10-26T01:42:03Z) - Selective Generation for Controllable Language Models [19.909671258499184]
Trustworthiness of generative language models (GLMs) is crucial in their deployment to critical decision making systems.<n>We propose two selective generation algorithms which control the false discovery rate with respect to the textual entailment relation (FDR-E)<n>$textttSGentextttSup$, a direct modification of the selective prediction, exploits entailment-labeled data, annotated by humans.<n>Since human annotation is costly, we propose a semi-supervised version, $textttSGentextttSemi$, which fully utilizes the un
arXiv Detail & Related papers (2023-07-18T13:36:24Z) - Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM)
We propose a decoding algorithm integrating the self-evaluation guidance via beam search.
Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z) - Stochastic Contextual Dueling Bandits under Linear Stochastic
Transitivity Models [25.336599480692122]
We consider the regret minimization task in a dueling bandits problem with context information.
We propose a computationally efficient algorithm, $texttCoLSTIM$, which makes its choice based on imitating the feedback process.
Our experiments demonstrate its superiority over state-of-art algorithms for special cases of CoLST models.
arXiv Detail & Related papers (2022-02-09T17:44:19Z) - Nearly Dimension-Independent Sparse Linear Bandit over Small Action
Spaces via Best Subset Selection [71.9765117768556]
We consider the contextual bandit problem under the high dimensional linear model.
This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine.
We propose doubly growing epochs and estimating the parameter using the best subset selection method.
arXiv Detail & Related papers (2020-09-04T04:10:39Z) - Progressive Identification of True Labels for Partial-Label Learning [112.94467491335611]
Partial-label learning (PLL) is a typical weakly supervised learning problem, where each training instance is equipped with a set of candidate labels among which only one is the true label.
Most existing methods elaborately designed as constrained optimizations that must be solved in specific manners, making their computational complexity a bottleneck for scaling up to big data.
This paper proposes a novel framework of classifier with flexibility on the model and optimization algorithm.
arXiv Detail & Related papers (2020-02-19T08:35:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.