Which LLM to Play? Convergence-Aware Online Model Selection with
Time-Increasing Bandits
- URL: http://arxiv.org/abs/2403.07213v1
- Date: Mon, 11 Mar 2024 23:52:46 GMT
- Title: Which LLM to Play? Convergence-Aware Online Model Selection with
Time-Increasing Bandits
- Authors: Yu Xia, Fang Kong, Tong Yu, Liya Guo, Ryan A. Rossi, Sungchul Kim,
Shuai Li
- Abstract summary: We propose a time-increasing bandit algorithm TI-UCB, which effectively predicts the increase of model performances.
Our results highlight the importance of utilizing increasing-then-converging pattern for more efficient and economic model selection.
- Score: 43.65904435249823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Web-based applications such as chatbots, search engines and news
recommendations continue to grow in scale and complexity with the recent surge
in the adoption of LLMs. Online model selection has thus garnered increasing
attention due to the need to choose the best model among a diverse set while
balancing task reward and exploration cost. Organizations faces decisions like
whether to employ a costly API-based LLM or a locally finetuned small LLM,
weighing cost against performance. Traditional selection methods often evaluate
every candidate model before choosing one, which are becoming impractical given
the rising costs of training and finetuning LLMs. Moreover, it is undesirable
to allocate excessive resources towards exploring poor-performing models. While
some recent works leverage online bandit algorithm to manage such
exploration-exploitation trade-off in model selection, they tend to overlook
the increasing-then-converging trend in model performances as the model is
iteratively finetuned, leading to less accurate predictions and suboptimal
model selections.
In this paper, we propose a time-increasing bandit algorithm TI-UCB, which
effectively predicts the increase of model performances due to finetuning and
efficiently balances exploration and exploitation in model selection. To
further capture the converging points of models, we develop a change detection
mechanism by comparing consecutive increase predictions. We theoretically prove
that our algorithm achieves a logarithmic regret upper bound in a typical
increasing bandit setting, which implies a fast convergence rate. The advantage
of our method is also empirically validated through extensive experiments on
classification model selection and online selection of LLMs. Our results
highlight the importance of utilizing increasing-then-converging pattern for
more efficient and economic model selection in the deployment of LLMs.
Related papers
- Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection [80.63946798650653]
Decision centers on whether to use a large LLM with better performance or a smaller one with reduced costs.
We propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion.
Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
arXiv Detail & Related papers (2024-05-03T14:38:59Z) - A Two-Phase Recall-and-Select Framework for Fast Model Selection [13.385915962994806]
We propose a two-phase (coarse-recall and fine-selection) model selection framework.
It aims to enhance the efficiency of selecting a robust model by leveraging the models' training performances on benchmark datasets.
It has been demonstrated that the proposed methodology facilitates the selection of a high-performing model at a rate about 3x times faster than conventional baseline methods.
arXiv Detail & Related papers (2024-03-28T14:44:44Z) - Modeling Choice via Self-Attention [8.394221523847325]
We show that our attention-based choice model is a low-optimal generalization of the Halo Multinomial Logit (Halo-MNL) model.
We also establish the first realistic-scale benchmark for choice estimation on real data, conducting an evaluation of existing models.
arXiv Detail & Related papers (2023-11-11T11:13:07Z) - Anytime Model Selection in Linear Bandits [61.97047189786905]
We develop ALEXP, which has an exponentially improved dependence on $M$ for its regret.
Our approach utilizes a novel time-uniform analysis of the Lasso, establishing a new connection between online learning and high-dimensional statistics.
arXiv Detail & Related papers (2023-07-24T15:44:30Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Bellman: A Toolbox for Model-Based Reinforcement Learning in TensorFlow [14.422129911404472]
Bellman aims to fill this gap and introduces the first thoroughly designed and tested model-based RL toolbox.
Our modular approach enables to combine a wide range of environment models with generic model-based agent classes that recover state-of-the-art algorithms.
arXiv Detail & Related papers (2021-03-26T11:32:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.