Related papers: CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And Faithfulness

CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And Faithfulness

URL: http://arxiv.org/abs/2402.14833v1
Date: Sat, 17 Feb 2024 22:37:17 GMT
Title: CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And Faithfulness
Authors: Jiayi Liu, Tinghan Yang, Jennifer Neville
Abstract summary: CliqueParcel is designed to improve efficiency of large language models (LLMs) during the inference process. CliqueParcel is tested on eight widely recognized datasets. This work provides novel insights into inference efficiency and demonstrates promising performance.
Score: 13.554160815699435
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have become pivotal in recent research. However, during the inference process, LLMs still require substantial resources. In this paper, we propose CliqueParcel, a method designed to improve the efficiency of LLMs via prompt batching. Existing strategies to optimize inference efficiency often compromise on output quality, leading to a discounted output problem. This issue might result in reduced accuracy or outputs that are less detailed. CliqueParcel is our answer to this challenge. While ensuring accuracy and minimizing deviations from the original outputs (i.e., faithfulness), our method significantly improves efficiency during inference. To lay the groundwork, we first redefine efficiency measurements by excluding the reduction in running time due to shorter lengths. Then, we provide a comprehensive trade-off between efficiency and faithfulness to clarify the nature of the 'discounted output' problem. Within the CliqueParcel framework, we suggest multiple batching sub-methods and discuss the specific scenarios in which they can be applied. During evaluation, CliqueParcel is tested on eight widely recognized datasets, which can be classified into three types: reading comprehension, open-source question-answering, and reasoning. Our experiments explore the performance of CliqueParcel, including efficiency, faithfulness, and the trade-off between them. This work provides novel insights into inference efficiency and demonstrates promising performance.

Related papers

Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection [52.716143424856185]
We propose LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection. LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors. Our method also outperforms the greedy search in attribution efficiency, being 1.6 times faster.
arXiv Detail & Related papers (2025-04-01T06:58:15Z)
PromptDistill: Query-based Selective Token Retention in Intermediate Layers for Efficient Large Language Model Inference [23.828853446344663]
We propose PromptDistill, a training-free method that improves inference efficiency while preserving generation quality. PromptDistill identifies and retains the most informative tokens by leveraging attention interactions in early layers, preserving their hidden states while reducing the computational burden in later layers.
arXiv Detail & Related papers (2025-03-30T01:47:23Z)
Self-Supervised Prompt Optimization [16.06653117043314]
Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities. Existing prompt optimization methods rely heavily on external references such as ground truth or by humans. We propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks.
arXiv Detail & Related papers (2025-02-07T17:45:16Z)
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System [75.25394449773052]
Large Language Model (LLM) based multi-agent systems (MAS) show remarkable potential in collaborative problem-solving. Yet they still face critical challenges: low communication efficiency, poor scalability, and a lack of effective parameter-updating optimization methods. We present Optima, a novel framework that addresses these issues by significantly enhancing both communication efficiency and task effectiveness.
arXiv Detail & Related papers (2024-10-10T17:00:06Z)
QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries. We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks. Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z)
Q-PEFT: Query-dependent Parameter Efficient Fine-tuning for Text Reranking with Large Language Models [28.105271954633682]
We introduce a query-dependent parameter efficient fine-tuning (Q-PEFT) approach for text reranking to leak information to Large Language Models (LLMs) We utilize the query to extract the top-$k$ tokens from input documents, serving as contextual clues. We further augment Q-PEFT by substituting the retrieval mechanism with a multi-head attention layer to achieve end-to-end training and cover all the tokens in the documents.
arXiv Detail & Related papers (2024-04-06T06:44:41Z)
Enhancing Low-Resource LLMs Classification with PEFT and Synthetic Data [36.09359953556684]
Large Language Models (LLMs) operating in 0-shot or few-shot settings achieve competitive results in Text Classification tasks. In-Context Learning (ICL) typically achieves better accuracy than the 0-shot setting, but it pays in terms of efficiency, due to the longer input prompt.
arXiv Detail & Related papers (2024-04-03T03:24:19Z)
See, Say, and Segment: Teaching LMMs to Overcome False Premises [67.36381001664635]
We propose a cascading and joint training approach for LMMs to solve this task. Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, and finally "segment" by outputting the mask of the desired objects if they exist.
arXiv Detail & Related papers (2023-12-13T18:58:04Z)
Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data. The training process of Large Language Models (LLMs) generally incurs the update of significant parameters. This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z)
Compressing Context to Enhance Inference Efficiency of Large Language Models [26.75216730927996]
This paper proposes a method called Selective Context to enhance the inference efficiency of large language models (LLMs) We test our approach using common data sources requiring long context processing: arXiv papers, news articles, and long conversations. Experimental results show that Selective Context significantly reduces memory cost and decreases generation latency.
arXiv Detail & Related papers (2023-10-09T23:03:24Z)
Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL [62.824464372594576]
We aim to enhance arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization. We introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data.
arXiv Detail & Related papers (2023-09-13T01:12:52Z)
OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs. Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.