Related papers: LiveMind: Low-latency Large Language Models with Simultaneous Inference

LiveMind: Low-latency Large Language Models with Simultaneous Inference

URL: http://arxiv.org/abs/2406.14319v1
Date: Thu, 20 Jun 2024 13:52:30 GMT
Title: LiveMind: Low-latency Large Language Models with Simultaneous Inference
Authors: Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li,
Abstract summary: We introduce a novel low-latency inference framework for large language models (LLMs) inference. By reallocating computational processes to prompt input phase, we achieve a substantial reduction in latency. For long prompts exceeding 20 sentences, the response latency can be reduced by up to 93%.
Score: 9.795240210326346
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, we introduce a novel low-latency inference framework for large language models (LLMs) inference which enables LLMs to perform inferences with incomplete prompts. By reallocating computational processes to prompt input phase, we achieve a substantial reduction in latency, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the visibility of the streaming prompt to the model, allowing it to infer from incomplete prompts or await additional prompts. Compared with traditional inference methods that utilize complete prompts, our approach demonstrates an average reduction of 59% in response latency on the MMLU-Pro dataset, while maintaining comparable accuracy. Additionally, our framework facilitates collaborative inference and output across different models. By employing an LLM for inference and a small language model (SLM) for output, we achieve an average 68% reduction in response latency, alongside a 5.5% improvement in accuracy on the MMLU-Pro dataset compared with the SLM baseline. For long prompts exceeding 20 sentences, the response latency can be reduced by up to 93%.

Related papers

Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission [87.68447072141402]
Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers.<n>We propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL)
arXiv Detail & Related papers (2025-06-30T02:56:11Z)
Learning to Inference Adaptively for Multimodal Large Language Models [19.510735093226703]
We introduce AdaLLaVA, an adaptive inference framework that learns to reconfigure operations in an MLLM during inference. We conduct experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime.
arXiv Detail & Related papers (2025-03-13T21:39:38Z)
Prompt Inversion Attack against Collaborative Inference of Large Language Models [14.786666134508645]
We introduce the concept of prompt inversion attack (PIA), where a malicious participant intends to recover the input prompt through the activation transmitted by its previous participant. Our method achieves an 88.4% token accuracy on the Skytrax dataset with the Llama-65B model when inverting the maximum number of transformer layers.
arXiv Detail & Related papers (2025-03-12T03:20:03Z)
From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning [31.95005389919542]
Scaling data and model size has been proven effective for boosting the performance of large language models. In this work, we introduce Aggregation Fine-Tuning (AFT), a supervised finetuning paradigm. Empirical evaluations on benchmark datasets show that AFT-trained models substantially outperform standard SFT.
arXiv Detail & Related papers (2025-01-21T04:11:59Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives. Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance. In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection [8.683288452838136]
Large Language Models (LLMs) have shown promise for human-like conversations but are primarily pre-trained on text data. We propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over the text-only approach.
arXiv Detail & Related papers (2024-06-13T22:52:07Z)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z)
Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI [10.82017289243097]
Large Language Models (LLMs) are capable of reasoning over diverse input data modalities through pre-trained encoders. m-LLM improves the task accuracy by up to 4% compared to the best existing scheme.
arXiv Detail & Related papers (2023-12-13T04:08:59Z)
Adapting LLMs for Efficient, Personalized Information Retrieval: Methods and Implications [0.7832189413179361]
Large Language Models (LLMs) excel in comprehending and generating human-like text. This paper explores strategies for integrating Language Models (LLMs) with Information Retrieval (IR) systems.
arXiv Detail & Related papers (2023-11-21T02:01:01Z)
Unlocking the Potential of User Feedback: Leveraging Large Language Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model. This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models. Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z)
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z)
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations. We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.