Related papers: DroidSpeak: Enhancing Cross-LLM Communication

Related papers

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources. This paper formulates LLM inference optimization as a multi-stage online scheduling problem. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z)
Learning LLM Preference over Intra-Dialogue Pairs: A Framework for Utterance-level Understandings [9.763273544617176]
Large language models (LLMs) have demonstrated remarkable capabilities in handling complex dialogue tasks without requiring use case-specific fine-tuning. In this paper, we introduce a simple yet effective framework to address this challenge. Our approach is specifically designed for per-utterance classification problems, which encompass tasks such as intent detection, dialogue state tracking, and more.
arXiv Detail & Related papers (2025-03-07T17:46:13Z)
Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation [52.89847760590189]
3D scene understanding is a critical yet challenging task in autonomous driving. Recent methods leverage the range-view representation to improve processing efficiency. We re-design the workflow for range-view-based LiDAR semantic segmentation.
arXiv Detail & Related papers (2025-02-13T12:39:26Z)
Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs) RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z)
Federated Fine-Tuning of LLMs: Framework Comparison and Research Directions [59.5243730853157]
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets. This article conducts a comparative analysis of three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues.
arXiv Detail & Related papers (2025-01-08T11:37:06Z)
Balancing Accuracy and Efficiency in Multi-Turn Intent Classification for LLM-Powered Dialog Systems in Production [6.459396785817196]
This paper presents two novel approaches to enhance scalability and reduce latency in production dialogue systems. First, we introduce Symbol Tuning, which simplifies intent labels to reduce task complexity and improve performance in multi-turn dialogues. Second, we propose C-LARA, a framework that employs LLMs for data augmentation and pseudo-labeling to generate synthetic multi-turn dialogues.
arXiv Detail & Related papers (2024-11-19T07:48:35Z)
Remote Timing Attacks on Efficient Language Model Inference [63.79839291641793]
We show it is possible to exploit timing differences to mount a timing attack. We show how it is possible to learn the topic of a user's conversation with 90%+ precision. An adversary can leverage a boosting attack to recover PII placed in messages for open source systems.
arXiv Detail & Related papers (2024-10-22T16:51:36Z)
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models [19.510078997414606]
EPIC introduces position-independent context caching for large language models. EPIC delivers up to 8x improvements in TTFT and 7x throughput over existing systems.
arXiv Detail & Related papers (2024-10-20T08:42:29Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent) It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation. The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z)
Agent-driven Generative Semantic Communication with Cross-Modality and Prediction [57.335922373309074]
We propose a novel agent-driven generative semantic communication framework based on reinforcement learning. In this work, we develop an agent-assisted semantic encoder with cross-modality capability, which can track the semantic changes, channel condition, to perform adaptive semantic extraction and sampling. The effectiveness of the designed models has been verified using the UA-DETRAC dataset, demonstrating the performance gains of the overall A-GSC framework.
arXiv Detail & Related papers (2024-04-10T13:24:27Z)
Context-aware Communication for Multi-agent Reinforcement Learning [6.109127175562235]
We develop a context-aware communication scheme for multi-agent reinforcement learning (MARL) In the first stage, agents exchange coarse representations in a broadcast fashion, providing context for the second stage. Following this, agents utilize attention mechanisms in the second stage to selectively generate messages personalized for the receivers. To evaluate the effectiveness of CACOM, we integrate it with both actor-critic and value-based MARL algorithms.
arXiv Detail & Related papers (2023-12-25T03:33:08Z)
TESS: A Multi-intent Parser for Conversational Multi-Agent Systems with Decentralized Natural Language Understanding Models [6.470108226184637]
Multi-agent systems complicate the natural language understanding of user intents. We propose an efficient parsing and orchestration pipeline algorithm to service multi-intent utterances from the user.
arXiv Detail & Related papers (2023-12-19T03:39:23Z)
Communication-Efficient Federated Optimization over Semi-Decentralized Networks [42.11743453542266]
Communication efficiency is one of the most challenging bottlenecks in large-scale networks. We study the communication efficiency under semi-decentralized communication protocol, in which agents can perform both agent-to-agent and agent-to-server communication.
arXiv Detail & Related papers (2023-11-30T18:37:15Z)
OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking [16.057622631156164]
Large language models (LLMs) have revolutionized the landscape of Natural Language Processing systems, but are computationally expensive. Previous studies have explored various approaches to harness the potential of Small Language Models (SLMs) as cost-effective alternatives to their larger counterparts. This work presents a novel SLM/LLM routing framework designed to improve computational efficiency and enhance task performance.
arXiv Detail & Related papers (2023-11-16T10:30:55Z)
Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation [51.399695200838586]
We propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. Experiments on it,es,de->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
arXiv Detail & Related papers (2023-10-23T11:00:27Z)
Towards Efficient Dialogue Pre-training with Transferable and Interpretable Latent Structure [77.30953347462452]
This paper proposes a novel dialogue generation model with a latent structure that is easily transferable from the general domain to downstream tasks in a lightweight and transparent way. Thanks to the transferable latent structure, our model is able to yield better dialogue responses than four strong baselines in terms of both automatic and human evaluations.
arXiv Detail & Related papers (2022-10-22T14:46:43Z)
Distributionally Robust Recurrent Decoders with Random Network Distillation [93.10261573696788]
We propose a method based on OOD detection with Random Network Distillation to allow an autoregressive language model to disregard OOD context during inference. We apply our method to a GRU architecture, demonstrating improvements on multiple language modeling (LM) datasets.
arXiv Detail & Related papers (2021-10-25T19:26:29Z)
Accelerating Federated Edge Learning via Optimized Probabilistic Device Scheduling [57.271494741212166]
This paper formulates and solves the communication time minimization problem. It is found that the optimized policy gradually turns its priority from suppressing the remaining communication rounds to reducing per-round latency as the training process evolves. The effectiveness of the proposed scheme is demonstrated via a use case on collaborative 3D objective detection in autonomous driving.
arXiv Detail & Related papers (2021-07-24T11:39:17Z)
Minimizing Communication while Maximizing Performance in Multi-Agent Reinforcement Learning [5.612141846711729]
Inter-agent communication can significantly increase performance in multi-agent tasks that require co-ordination. In real-world applications, where communication may be limited by system constraints like bandwidth, power and network capacity, one might need to reduce the number of messages that are sent. We show that we can reduce communication by 75% with no loss of performance.
arXiv Detail & Related papers (2021-06-15T23:13:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.