Related papers: Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

URL: http://arxiv.org/abs/2305.13144v2
Date: Sun, 28 May 2023 08:22:19 GMT
Title: Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Authors: Zangwei Zheng, Xiaozhe Ren, Fuzhao Xue, Yang Luo, Xin Jiang, Yang You
Abstract summary: Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
Score: 22.08897444328099
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. However, the inference process for LLMs comes with significant computational costs. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs. Our approach begins by tapping into the potential of LLMs to accurately perceive and predict the response length with minimal overhead. By leveraging this information, we introduce an efficient sequence scheduling technique that groups queries with similar response lengths into micro-batches. We evaluate our approach on real-world instruction datasets using the LLaMA-based model, and our results demonstrate an impressive 86% improvement in inference throughput without compromising effectiveness. Notably, our method is orthogonal to other inference acceleration techniques, making it a valuable addition to many existing toolkits (e.g., FlashAttention, Quantization) for LLM inference.

Related papers

A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis [26.505386645322506]
Large Language Models (LLMs) have garnered increasing attention in the field of natural language processing. In this paper, we shed light on a comprehensive evaluation of LLMs in the ABSA field, involving 13 datasets, 8 ABSA subtasks, and 6 LLMs. Our experiments demonstrate that LLMs achieve a new state-of-the-art performance compared to fine-tuned Small Language Models (SLMs) in the fine-tuning-dependent paradigm.
arXiv Detail & Related papers (2024-12-03T08:54:17Z)
LLMs can Schedule [3.435169201271934]
Job shop scheduling problem (JSSP) remains a significant hurdle in optimizing production processes. This paper explores the potential of Large Language Models (LLMs) for JSSP. Surprisingly, our findings demonstrate that LLM-based scheduling can achieve performance comparable to other neural approaches.
arXiv Detail & Related papers (2024-08-13T15:53:58Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Efficient Sequential Decision Making with Large Language Models [19.083642464977224]
This paper focuses on extending the success of large language models (LLMs) to sequential decision making. Existing efforts either (i) re-train or finetune LLMs for decision making, or (ii) design prompts for pretrained LLMs. We propose a new approach that leverages online model selection algorithms to efficiently incorporate LLMs agents into sequential decision making.
arXiv Detail & Related papers (2024-06-17T22:13:22Z)
Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently. Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting. We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z)
From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z)
One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z)
Improve Temporal Awareness of LLMs for Sequential Recommendation [61.723928508200196]
Large language models (LLMs) have demonstrated impressive zero-shot abilities in solving a wide range of general-purpose tasks. LLMs fall short in recognizing and utilizing temporal information, rendering poor performance in tasks that require an understanding of sequential data. We propose three prompting strategies to exploit temporal information within historical interactions for LLM-based sequential recommendation.
arXiv Detail & Related papers (2024-05-05T00:21:26Z)
Online Cascade Learning for Efficient Inference over Streams [9.516197133796437]
Large Language Models (LLMs) have a natural role in answering complex queries about data streams. We propose online cascade learning, the first approach to address this challenge. We formulate the task of learning cascades online as an imitation-learning problem.
arXiv Detail & Related papers (2024-02-07T01:46:50Z)
Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward [29.81212051279456]
Recent advancements in model compression and system-level optimization methods aim to enhance LLM inference. This survey offers an overview of these methods, emphasizing recent developments.
arXiv Detail & Related papers (2024-02-02T06:29:34Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.