DroidSpeak: Enhancing Cross-LLM Communication
- URL: http://arxiv.org/abs/2411.02820v1
- Date: Tue, 05 Nov 2024 05:41:41 GMT
- Title: DroidSpeak: Enhancing Cross-LLM Communication
- Authors: Yuhan Liu, Esha Choukse, Shan Lu, Junchen Jiang, Madan Musuvathi,
- Abstract summary: We introduce DroidSpeak, a novel framework to target this cross-LLM communication.
We efficiently bypass the need to reprocess entire contexts for fine-tuned versions of the same foundational model.
Our findings underscore the potential to create more efficient and scalable multi-agent systems.
- Score: 15.901409892276288
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In multi-agent systems utilizing Large Language Models (LLMs), communication between agents traditionally relies on natural language. This communication often includes the full context of the query so far, which can introduce significant prefill-phase latency, especially with long contexts. We introduce DroidSpeak, a novel framework to target this cross-LLM communication by leveraging the reuse of intermediate data, such as input embeddings (E-cache) and key-value caches (KV-cache). We efficiently bypass the need to reprocess entire contexts for fine-tuned versions of the same foundational model. This approach allows faster context integration while maintaining the quality of task performance. Experimental evaluations demonstrate DroidSpeak's ability to significantly accelerate inter-agent communication, achieving up to a 2.78x speedup in prefill latency with negligible loss in accuracy. Our findings underscore the potential to create more efficient and scalable multi-agent systems.
Related papers
- Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z) - dKV-Cache: The Cache for Diffusion Language Models [53.85291644298835]
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models.<n>We propose a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs.<n>Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process.
arXiv Detail & Related papers (2025-05-21T17:32:10Z) - Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources.
This paper formulates LLM inference optimization as a multi-stage online scheduling problem.
We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z) - Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings [9.763273544617176]
Large language models (LLMs) have demonstrated remarkable capabilities in handling complex dialogue tasks without requiring use case-specific fine-tuning.
In this paper, we introduce a simple yet effective framework to address this challenge.
Our approach is specifically designed for per-utterance classification problems, which encompass tasks such as intent detection, dialogue state tracking, and more.
arXiv Detail & Related papers (2025-03-07T17:46:13Z) - LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression [7.67622140575795]
We present LVLM-Compress-Bench, a framework to study the broad impact of compression on the generative performance of LVLMs with multi-modal input driven tasks.<n>We use four LVLM variants of the popular LLaVA framework to present our analysis via integrating various state-of-the-art KV and weight compression methods.<n>Our framework demonstrates the compression impact on both general and critical metrics leveraging a combination of real world and synthetic datasets.
arXiv Detail & Related papers (2025-03-06T21:21:18Z) - Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.
This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation.
Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z) - KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse [35.97391418064724]
We describe KVLink, an approach for efficient key-value ( KV) cache reuse in large language models (LLMs)<n> KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention.<n> Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods.
arXiv Detail & Related papers (2025-02-21T23:34:29Z) - FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation [52.89847760590189]
3D scene understanding is a critical yet challenging task in autonomous driving.
Recent methods leverage the range-view representation to improve processing efficiency.
We re-design the workflow for range-view-based LiDAR semantic segmentation.
arXiv Detail & Related papers (2025-02-13T12:39:26Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)
RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.
RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - Federated Fine-Tuning of LLMs: Framework Comparison and Research Directions [59.5243730853157]
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets.
This article conducts a comparative analysis of three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues.
arXiv Detail & Related papers (2025-01-08T11:37:06Z) - SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.<n>We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.<n>Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z) - Balancing Accuracy and Efficiency in Multi-Turn Intent Classification for LLM-Powered Dialog Systems in Production [6.459396785817196]
This paper presents two novel approaches to enhance scalability and reduce latency in production dialogue systems.
First, we introduce Symbol Tuning, which simplifies intent labels to reduce task complexity and improve performance in multi-turn dialogues.
Second, we propose C-LARA, a framework that employs LLMs for data augmentation and pseudo-labeling to generate synthetic multi-turn dialogues.
arXiv Detail & Related papers (2024-11-19T07:48:35Z) - Remote Timing Attacks on Efficient Language Model Inference [63.79839291641793]
We show it is possible to exploit timing differences to mount a timing attack.
We show how it is possible to learn the topic of a user's conversation with 90%+ precision.
An adversary can leverage a boosting attack to recover PII placed in messages for open source systems.
arXiv Detail & Related papers (2024-10-22T16:51:36Z) - EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models [19.510078997414606]
EPIC introduces position-independent context caching for large language models.
EPIC delivers up to 8x improvements in TTFT and 7x throughput over existing systems.
arXiv Detail & Related papers (2024-10-20T08:42:29Z) - A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference [41.149350870029046]
Key-value ( KV) cache has been found effective in efficient inference of large language models (LLMs)<n>We propose a unified framework that covers several recent methods and their novel variants.<n>We find that when reducing the size of the KV cache by 2$times$, most configurations can achieve higher throughput than standard transformers.
arXiv Detail & Related papers (2024-10-18T13:01:14Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption [66.97998742151918]
Large Language Models (LLMs) have revolutionized various industries with their advanced language comprehension.
However, their efficiency is challenged by the Transformer architecture's struggle with handling long texts.
KV Cache has emerged as a pivotal solution, converting the time complexity of token generation from quadratic to linear.
arXiv Detail & Related papers (2024-07-25T12:56:22Z) - InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [0.5899781520375794]
Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks.
serving inference for generating long contents poses a challenge due to the enormous memory footprint of the transient state.
InfiniGen is a novel KV cache management framework tailored for long-text generation.
arXiv Detail & Related papers (2024-06-28T07:41:26Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent)
It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation.
The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z) - Agent-driven Generative Semantic Communication with Cross-Modality and Prediction [57.335922373309074]
We propose a novel agent-driven generative semantic communication framework based on reinforcement learning.
In this work, we develop an agent-assisted semantic encoder with cross-modality capability, which can track the semantic changes, channel condition, to perform adaptive semantic extraction and sampling.
The effectiveness of the designed models has been verified using the UA-DETRAC dataset, demonstrating the performance gains of the overall A-GSC framework.
arXiv Detail & Related papers (2024-04-10T13:24:27Z) - Context-aware Communication for Multi-agent Reinforcement Learning [6.109127175562235]
We develop a context-aware communication scheme for multi-agent reinforcement learning (MARL)
In the first stage, agents exchange coarse representations in a broadcast fashion, providing context for the second stage.
Following this, agents utilize attention mechanisms in the second stage to selectively generate messages personalized for the receivers.
To evaluate the effectiveness of CACOM, we integrate it with both actor-critic and value-based MARL algorithms.
arXiv Detail & Related papers (2023-12-25T03:33:08Z) - TESS: A Multi-intent Parser for Conversational Multi-Agent Systems with
Decentralized Natural Language Understanding Models [6.470108226184637]
Multi-agent systems complicate the natural language understanding of user intents.
We propose an efficient parsing and orchestration pipeline algorithm to service multi-intent utterances from the user.
arXiv Detail & Related papers (2023-12-19T03:39:23Z) - Communication-Efficient Federated Optimization over Semi-Decentralized
Networks [42.11743453542266]
Communication efficiency is one of the most challenging bottlenecks in large-scale networks.
We study the communication efficiency under semi-decentralized communication protocol, in which agents can perform both agent-to-agent and agent-to-server communication.
arXiv Detail & Related papers (2023-11-30T18:37:15Z) - OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking [16.057622631156164]
Large language models (LLMs) have revolutionized the landscape of Natural Language Processing systems, but are computationally expensive.
Previous studies have explored various approaches to harness the potential of Small Language Models (SLMs) as cost-effective alternatives to their larger counterparts.
This work presents a novel SLM/LLM routing framework designed to improve computational efficiency and enhance task performance.
arXiv Detail & Related papers (2023-11-16T10:30:55Z) - Leveraging Timestamp Information for Serialized Joint Streaming
Recognition and Translation [51.399695200838586]
We propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder.
Experiments on it,es,de->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
arXiv Detail & Related papers (2023-10-23T11:00:27Z) - Towards Efficient Dialogue Pre-training with Transferable and
Interpretable Latent Structure [77.30953347462452]
This paper proposes a novel dialogue generation model with a latent structure that is easily transferable from the general domain to downstream tasks in a lightweight and transparent way.
Thanks to the transferable latent structure, our model is able to yield better dialogue responses than four strong baselines in terms of both automatic and human evaluations.
arXiv Detail & Related papers (2022-10-22T14:46:43Z) - Distributionally Robust Recurrent Decoders with Random Network
Distillation [93.10261573696788]
We propose a method based on OOD detection with Random Network Distillation to allow an autoregressive language model to disregard OOD context during inference.
We apply our method to a GRU architecture, demonstrating improvements on multiple language modeling (LM) datasets.
arXiv Detail & Related papers (2021-10-25T19:26:29Z) - Accelerating Federated Edge Learning via Optimized Probabilistic Device
Scheduling [57.271494741212166]
This paper formulates and solves the communication time minimization problem.
It is found that the optimized policy gradually turns its priority from suppressing the remaining communication rounds to reducing per-round latency as the training process evolves.
The effectiveness of the proposed scheme is demonstrated via a use case on collaborative 3D objective detection in autonomous driving.
arXiv Detail & Related papers (2021-07-24T11:39:17Z) - Minimizing Communication while Maximizing Performance in Multi-Agent
Reinforcement Learning [5.612141846711729]
Inter-agent communication can significantly increase performance in multi-agent tasks that require co-ordination.
In real-world applications, where communication may be limited by system constraints like bandwidth, power and network capacity, one might need to reduce the number of messages that are sent.
We show that we can reduce communication by 75% with no loss of performance.
arXiv Detail & Related papers (2021-06-15T23:13:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.