Batch Prompting: Efficient Inference with Large Language Model APIs
- URL: http://arxiv.org/abs/2301.08721v2
- Date: Tue, 24 Oct 2023 07:58:35 GMT
- Title: Batch Prompting: Efficient Inference with Large Language Model APIs
- Authors: Zhoujun Cheng, Jungo Kasai, Tao Yu
- Abstract summary: Performing inference on large volumes of samples with large language models (LLMs) can be computationally and financially costly.
We propose batch prompting, a simple yet effective prompting approach that enables the LLM to run inference in batches.
We extensively validate the effectiveness of batch prompting on ten datasets across commonsense QA, arithmetic reasoning, and NLI/NLU.
- Score: 37.70875323133654
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Performing inference on large volumes of samples with large language models
(LLMs) can be computationally and financially costly in industry and real-world
use. We propose batch prompting, a simple yet effective prompting approach that
enables the LLM to run inference in batches, instead of one sample at a time.
Our method reduces both token and time costs while retaining downstream
performance. We theoretically demonstrate that under a few-shot in-context
learning setting, the inference costs decrease almost inverse linearly with the
number of samples in each batch. We extensively validate the effectiveness of
batch prompting on ten datasets across commonsense QA, arithmetic reasoning,
and NLI/NLU: batch prompting significantly~(up to 5x with six samples in batch)
reduces the LLM (Codex) inference token and time costs while achieving better
or comparable performance. For state-of-the-art Chat-based LLMs, e.g., GPT-3.5
and GPT-4, we show the benefits of batch prompting also hold. Further analysis
shows that the number of samples in each batch and the complexity of tasks
affect its performance. Moreover, batch prompting can be applied across
different reasoning methods using LLMs. Our code can be found at the site
https://github.com/xlang-ai/batch-prompting.
Related papers
- Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows [1.6163129903911508]
Fine-tuning Small Language Models (SLMs) for real-world applications may no longer be clear.<n>We compare fine-tuning an SLM against prompting LLMs on the task of generating low-code in form.<n>We observe that while a good prompt can yield reasonable results, fine-tuning improves quality by 10% on average.
arXiv Detail & Related papers (2025-05-30T03:59:35Z) - M-Ped: Multi-Prompt Ensemble Decoding for Large Language Models [12.96619003056978]
This paper presents a novel multi-prompt ensemble decoding approach designed to bolster the generation quality of Large Language Models.
Given a unique input $X$, we submit $n$ variations of prompts with $X$ to LLMs in batch mode to decode and derive probability distributions.
For each token prediction, we calculate the ensemble probability by averaging the $n$ probability distributions within the batch, utilizing this aggregated probability to generate the token.
arXiv Detail & Related papers (2024-12-24T09:06:58Z) - Skipping Computations in Multimodal LLMs [63.29737699997859]
This study investigates redundancy in Multimodal Large Language Models (MLLMs) during inference.
We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention layers.
Our findings validate that significant amount of computations can be avoided at inference time.
arXiv Detail & Related papers (2024-10-12T09:21:45Z) - Auto-Demo Prompting: Leveraging Generated Outputs as Demonstrations for Enhanced Batch Prompting [0.8238423959893132]
"Auto-Demo Prompting" is a novel approach that leverages the question-output pairs from earlier questions within a batch as demonstrations for subsequent answer inference.
Our method effectively bridges the gap between batch prompting and few-shot prompting, enhancing performance with only a slight compromise in token usage.
arXiv Detail & Related papers (2024-10-02T16:34:40Z) - Efficient multi-prompt evaluation of LLMs [36.46258631685666]
We introduce PromptEval, a method for estimating performance across a large set of prompts.
We prove that PromptEval consistently estimates the performance distribution and demonstrate its efficacy empirically.
We show how PromptEval can be useful in LLM-as-a-judge and best prompt identification applications.
arXiv Detail & Related papers (2024-05-27T14:24:47Z) - Not All Layers of LLMs Are Necessary During Inference [68.88671495401483]
We show that for some tasks, Large Language Models can achieve results comparable to the final output at some intermediate layers.
We propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance.
arXiv Detail & Related papers (2024-03-04T16:23:58Z) - PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task
Completion [96.47420221442397]
We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions.
We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence.
The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
arXiv Detail & Related papers (2023-11-03T08:06:35Z) - OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs.
Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z) - Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks.
In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z) - Instance-wise Prompt Tuning for Pretrained Language Models [72.74916121511662]
Instance-wise Prompt Tuning (IPT) is the first prompt learning paradigm that injects knowledge from the input data instances to the prompts.
IPT significantly outperforms task-based prompt learning methods, and achieves comparable performance to conventional finetuning with only 0.5% - 1.5% of tuned parameters.
arXiv Detail & Related papers (2022-06-04T10:08:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.