Batch Prompting: Efficient Inference with Large Language Model APIs
- URL: http://arxiv.org/abs/2301.08721v2
- Date: Tue, 24 Oct 2023 07:58:35 GMT
- Title: Batch Prompting: Efficient Inference with Large Language Model APIs
- Authors: Zhoujun Cheng, Jungo Kasai, Tao Yu
- Abstract summary: Performing inference on large volumes of samples with large language models (LLMs) can be computationally and financially costly.
We propose batch prompting, a simple yet effective prompting approach that enables the LLM to run inference in batches.
We extensively validate the effectiveness of batch prompting on ten datasets across commonsense QA, arithmetic reasoning, and NLI/NLU.
- Score: 37.70875323133654
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Performing inference on large volumes of samples with large language models
(LLMs) can be computationally and financially costly in industry and real-world
use. We propose batch prompting, a simple yet effective prompting approach that
enables the LLM to run inference in batches, instead of one sample at a time.
Our method reduces both token and time costs while retaining downstream
performance. We theoretically demonstrate that under a few-shot in-context
learning setting, the inference costs decrease almost inverse linearly with the
number of samples in each batch. We extensively validate the effectiveness of
batch prompting on ten datasets across commonsense QA, arithmetic reasoning,
and NLI/NLU: batch prompting significantly~(up to 5x with six samples in batch)
reduces the LLM (Codex) inference token and time costs while achieving better
or comparable performance. For state-of-the-art Chat-based LLMs, e.g., GPT-3.5
and GPT-4, we show the benefits of batch prompting also hold. Further analysis
shows that the number of samples in each batch and the complexity of tasks
affect its performance. Moreover, batch prompting can be applied across
different reasoning methods using LLMs. Our code can be found at the site
https://github.com/xlang-ai/batch-prompting.
Related papers
- Efficient multi-prompt evaluation of LLMs [36.46258631685666]
We introduce PromptEval, a method for estimating performance across a large set of prompts.
We prove that PromptEval consistently estimates the performance distribution and demonstrate its efficacy empirically.
arXiv Detail & Related papers (2024-05-27T14:24:47Z) - Preble: Efficient Distributed Prompt Scheduling for LLM Serving [8.706905652975554]
Many parts of prompts are repetitive across requests, and their attention results can be reused.
This paper proposes Preble, the first distributed LLM serving platform that targets and optimize for prompt sharing.
Preble outperforms the state-of-the-art average latency by 1.5X to 14.5X and p99 by 2X to 10X.
arXiv Detail & Related papers (2024-05-08T06:30:58Z) - Not All Layers of LLMs Are Necessary During Inference [68.88671495401483]
We show that for some tasks, Large Language Models can achieve results comparable to the final output at some intermediate layers.
We propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance.
arXiv Detail & Related papers (2024-03-04T16:23:58Z) - PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task
Completion [96.47420221442397]
We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions.
We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence.
The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
arXiv Detail & Related papers (2023-11-03T08:06:35Z) - OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs.
Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z) - Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks.
In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z) - PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z) - Instance-wise Prompt Tuning for Pretrained Language Models [72.74916121511662]
Instance-wise Prompt Tuning (IPT) is the first prompt learning paradigm that injects knowledge from the input data instances to the prompts.
IPT significantly outperforms task-based prompt learning methods, and achieves comparable performance to conventional finetuning with only 0.5% - 1.5% of tuned parameters.
arXiv Detail & Related papers (2022-06-04T10:08:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.