Large Language Model Routing with Benchmark Datasets
- URL: http://arxiv.org/abs/2309.15789v1
- Date: Wed, 27 Sep 2023 17:08:40 GMT
- Title: Large Language Model Routing with Benchmark Datasets
- Authors: Tal Shnitzer, Anthony Ou, M\'irian Silva, Kate Soule, Yuekai Sun,
Justin Solomon, Neil Thompson, Mikhail Yurochkin
- Abstract summary: No single model typically achieves the best accuracy in all tasks and use cases.
We propose a new formulation for the problem, in which benchmark datasets are repurposed to learn a "router" model for this selection.
We show that this problem can be reduced to a collection of binary classification tasks.
- Score: 40.42044096089315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is a rapidly growing number of open-source Large Language Models (LLMs)
and benchmark datasets to compare them. While some models dominate these
benchmarks, no single model typically achieves the best accuracy in all tasks
and use cases. In this work, we address the challenge of selecting the best LLM
out of a collection of models for new tasks. We propose a new formulation for
the problem, in which benchmark datasets are repurposed to learn a "router"
model for this LLM selection, and we show that this problem can be reduced to a
collection of binary classification tasks. We demonstrate the utility and
limitations of learning model routers from various benchmark datasets, where we
consistently improve performance upon using any single model for all tasks.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - GistScore: Learning Better Representations for In-Context Example
Selection with Gist Bottlenecks [3.9638110494107095]
In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts.
We propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning.
We show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers.
arXiv Detail & Related papers (2023-11-16T06:28:05Z) - Entity Matching using Large Language Models [4.189643331553922]
This paper investigates using generative large language models (LLMs) as a less task-specific training data-dependent alternative to PLM-based matchers.
We show that GPT4 can generate structured explanations for matching decisions.
arXiv Detail & Related papers (2023-10-17T13:12:32Z) - Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models.
We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset.
Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z) - Enhancing Subtask Performance of Multi-modal Large Language Model [12.033301861738952]
Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data.
This study selects multiple pre-trained models focused on the same subtask based on distinct evaluation approaches.
The results from multiple pre-trained models for the same subtask are compared using the LLM, and the best result is chosen as the outcome for that subtask.
arXiv Detail & Related papers (2023-08-31T05:37:21Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Event Classification with Multi-step Machine Learning [0.0]
Multi-step Machine Learning (ML) is organized into connected sub-tasks with known intermediate inference goals.
Differentiable Architecture Search (DARTS) and Single Path One-Shot NAS (SPOS-NAS) are tested, where the construction of loss functions is improved to keep all ML models smoothly learning.
Using DARTS and SPOS-NAS as an optimization and selection as well as the connections for multi-step machine learning systems, we find that (1) such a system can quickly and successfully select highly performant model combinations, and (2) the selected models are consistent with baseline algorithms, such as grid search
arXiv Detail & Related papers (2021-06-04T07:22:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.