Cache me if you Can: an Online Cost-aware Teacher-Student framework to
Reduce the Calls to Large Language Models
- URL: http://arxiv.org/abs/2310.13395v1
- Date: Fri, 20 Oct 2023 10:05:07 GMT
- Title: Cache me if you Can: an Online Cost-aware Teacher-Student framework to
Reduce the Calls to Large Language Models
- Authors: Ilias Stogiannidis, Stavros Vassos, Prodromos Malakasiotis, Ion
Androutsopoulos
- Abstract summary: Small and medium-sized enterprises (SMEs) cannot afford the cost of creating large task-specific training datasets.
Third-party services that allow them to prompt Large Language Models currently require a payment per call.
We propose a framework that allows reducing the calls to LLMs by caching previous responses and using them to train a local inexpensive model.
- Score: 13.799197575126442
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Prompting Large Language Models (LLMs) performs impressively in zero- and
few-shot settings. Hence, small and medium-sized enterprises (SMEs) that cannot
afford the cost of creating large task-specific training datasets, but also the
cost of pretraining their own LLMs, are increasingly turning to third-party
services that allow them to prompt LLMs. However, such services currently
require a payment per call, which becomes a significant operating expense
(OpEx). Furthermore, customer inputs are often very similar over time, hence
SMEs end-up prompting LLMs with very similar instances. We propose a framework
that allows reducing the calls to LLMs by caching previous LLM responses and
using them to train a local inexpensive model on the SME side. The framework
includes criteria for deciding when to trust the local model or call the LLM,
and a methodology to tune the criteria and measure the tradeoff between
performance and cost. For experimental purposes, we instantiate our framework
with two LLMs, GPT-3.5 or GPT-4, and two inexpensive students, a k-NN
classifier or a Multi-Layer Perceptron, using two common business tasks, intent
recognition and sentiment analysis. Experimental results indicate that
significant OpEx savings can be obtained with only slightly lower performance.
Related papers
- A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs [74.35290684163718]
A primary challenge in large language model (LLM) development is their onerous pre-training cost.
This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by leveraging a small language model (SLM)
arXiv Detail & Related papers (2024-10-24T14:31:52Z) - MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs [21.689490112983677]
We introduce MetaLLM, a framework that dynamically routes each query to the optimal large language models (LLMs) for classification tasks.
By framing the selection problem as a multi-armed bandit, MetaLLM balances prediction accuracy and cost efficiency under uncertainty.
Our experiments, conducted on popular LLM platforms, showcase MetaLLM's efficacy in real-world scenarios.
arXiv Detail & Related papers (2024-07-15T15:45:07Z) - Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently.
Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting.
We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z) - Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands.
Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs.
In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z) - SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing Fees [21.801053526411415]
Large Language Models (LLMs) have significantly boosted performance in natural language processing (NLP) tasks.
The deployment of high-performance LLMs incurs substantial costs, primarily due to the increased number of parameters aimed at enhancing model performance.
We introduce SMART, a novel framework designed to minimize the inference costs of NLP tasks while ensuring sufficient result quality.
arXiv Detail & Related papers (2024-03-11T17:45:47Z) - Knowledge Fusion of Large Language Models [73.28202188100646]
This paper introduces the notion of knowledge fusion for large language models (LLMs)
We externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM.
Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.
arXiv Detail & Related papers (2024-01-19T05:02:46Z) - Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [73.54562551341454]
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs.
We propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer.
This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
arXiv Detail & Related papers (2024-01-14T16:17:07Z) - Large Language Model Cascades with Mixture of Thoughts Representations
for Cost-efficient Reasoning [19.472937476936636]
Large language models (LLMs) have exhibited remarkable performance in a variety of tasks, but this strong performance often comes with the high expense of using paid API services.
In this paper, we are motivated to study building an LLM cascade to save the cost of using LLMs.
Our proposed cascades can achieve performance comparable to using solely the stronger LLM but require only 40% of its cost.
arXiv Detail & Related papers (2023-10-04T18:21:17Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - FrugalGPT: How to Use Large Language Models While Reducing Cost and
Improving Performance [36.94826820536239]
We review the cost associated with querying popular large language models (LLMs)
We discuss three types of strategies that users can exploit to reduce the inference cost associated with using LLMs.
Experiments show that FrugalGPT can match the performance of the best individual LLM with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost.
arXiv Detail & Related papers (2023-05-09T05:11:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.