Related papers: Large Language Model Routing with Benchmark Datasets

Large Language Model Routing with Benchmark Datasets

URL: http://arxiv.org/abs/2309.15789v1
Date: Wed, 27 Sep 2023 17:08:40 GMT
Title: Large Language Model Routing with Benchmark Datasets
Authors: Tal Shnitzer, Anthony Ou, M\'irian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, Mikhail Yurochkin
Abstract summary: No single model typically achieves the best accuracy in all tasks and use cases. We propose a new formulation for the problem, in which benchmark datasets are repurposed to learn a "router" model for this selection. We show that this problem can be reduced to a collection of binary classification tasks.
Score: 40.42044096089315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There is a rapidly growing number of open-source Large Language Models (LLMs) and benchmark datasets to compare them. While some models dominate these benchmarks, no single model typically achieves the best accuracy in all tasks and use cases. In this work, we address the challenge of selecting the best LLM out of a collection of models for new tasks. We propose a new formulation for the problem, in which benchmark datasets are repurposed to learn a "router" model for this LLM selection, and we show that this problem can be reduced to a collection of binary classification tasks. We demonstrate the utility and limitations of learning model routers from various benchmark datasets, where we consistently improve performance upon using any single model for all tasks.

Related papers

Capability Instruction Tuning: A New Paradigm for Dynamic LLM Routing [64.38277118982698]
Large Language Models (LLMs) have demonstrated human-like instruction-following abilities. In this work, we explore how to route the best-performing LLM for each instruction to achieve better overall performance. We develop a new paradigm, constructing capability instructions with model capability representation, user instruction, and performance inquiry prompts to assess the performance.
arXiv Detail & Related papers (2025-02-24T16:10:53Z)
Advancing Single- and Multi-task Text Classification through Large Language Model Fine-tuning [29.782832197148487]
Large language models (LLMs) have been widely used for text classification tasks. This study employed a diverse range of models and methods, varying in size and architecture, and including both fine-tuned and pre-trained approaches. We first assessed the performances of these LLMs on the 20 Newsgroups (20NG) and datasets, comparing them to encoder-only RoBERTa models. We explored the multi-task capabilities of both model types by combining multiple classification tasks, including intent detection and slot-filling, into a single model using data from both datasets.
arXiv Detail & Related papers (2024-12-11T18:06:44Z)
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild [84.57103623507082]
This paper introduces Model-GLUE, a holistic Large Language Models scaling guideline. Our work starts with a benchmarking of existing LLM scaling techniques, especially selective merging, and variants of mixture. Our methodology involves the clustering of mergeable models and optimal merging strategy selection, and the integration of clusters through a model mixture.
arXiv Detail & Related papers (2024-10-07T15:55:55Z)
EmbedLLM: Learning Compact Representations of Large Language Models [28.49433308281983]
We propose EmbedLLM, a framework designed to learn compact vector representations of Large Language Models. We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency.
arXiv Detail & Related papers (2024-10-03T05:43:24Z)
A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets [0.6144680854063939]
We introduce a benchmark aimed at better characterizing types of datasets where Deep Learning models excel. We evaluate 111 datasets with 20 different models, including both regression and classification tasks. Building on the results of this benchmark, we train a model that predicts scenarios where DL models outperform alternative methods with 86.1% accuracy.
arXiv Detail & Related papers (2024-08-27T06:58:52Z)
Enabling Small Models for Zero-Shot Classification through Model Label Learning [50.68074833512999]
We introduce a novel paradigm, Model Label Learning (MLL), which bridges the gap between models and their functionalities. Experiments on seven real-world datasets validate the effectiveness and efficiency of MLL.
arXiv Detail & Related papers (2024-08-21T09:08:26Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence. Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality. We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
GistScore: Learning Better Representations for In-Context Example Selection with Gist Bottlenecks [3.9638110494107095]
In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts. We propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning. We show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers.
arXiv Detail & Related papers (2023-11-16T06:28:05Z)
Enhancing Subtask Performance of Multi-modal Large Language Model [12.033301861738952]
Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data. This study selects multiple pre-trained models focused on the same subtask based on distinct evaluation approaches. The results from multiple pre-trained models for the same subtask are compared using the LLM, and the best result is chosen as the outcome for that subtask.
arXiv Detail & Related papers (2023-08-31T05:37:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.