Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models
- URL: http://arxiv.org/abs/2310.09949v4
- Date: Mon, 24 Mar 2025 18:01:48 GMT
- Title: Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models
- Authors: Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, Gustavo Alonso,
- Abstract summary: A Retrieval-Augmented Language Model (RALM) combines a large language model (LLM) with a vector database to retrieve context-specific knowledge.<n>We propose Chameleon, a heterogeneous accelerator system integrating both LLM and vector search accelerators.
- Score: 20.286113681831814
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: A Retrieval-Augmented Language Model (RALM) combines a large language model (LLM) with a vector database to retrieve context-specific knowledge during text generation. This strategy facilitates impressive generation quality even with smaller models, thus reducing computational demands by orders of magnitude. To serve RALMs efficiently and flexibly, we propose Chameleon, a heterogeneous accelerator system integrating both LLM and vector search accelerators in a disaggregated architecture. The heterogeneity ensures efficient serving for both inference and retrieval, while the disaggregation allows independent scaling of LLM and vector search accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements vector search accelerators on FPGAs and assigns LLM inference to GPUs, with CPUs as cluster coordinators. Evaluated on various RALMs, Chameleon exhibits up to 2.16$\times$ reduction in latency and 3.18x speedup in throughput compared to the hybrid CPU-GPU architecture. The promising results pave the way for adopting heterogeneous accelerators for not only LLM inference but also vector search in future RALM systems.
Related papers
- An Adaptive Vector Index Partitioning Scheme for Low-Latency RAG Pipeline [0.6445605125467574]
Retrieval Augmented Generation (RAG) systems enhance response quality by integrating Large Language Models (LLMs) with vector databases.
Existing optimizations for vector search and LLM serving have largely been developed in isolation.
This paper introduces VectorLiteRAG, an optimized vector index partitioning mechanism designed for RAG systems.
arXiv Detail & Related papers (2025-04-11T19:18:41Z) - HiVeGen -- Hierarchical LLM-based Verilog Generation for Scalable Chip Design [55.54477725000291]
HiVeGen is a hierarchical Verilog generation framework that decomposes generation tasks into hierarchical submodules.
automatic Design Space Exploration (DSE) into hierarchy-aware prompt generation, introducing weight-based retrieval to enhance code reuse.
Real-time human-computer interaction to lower error-correction cost, significantly improving the quality of generated designs.
arXiv Detail & Related papers (2024-12-06T19:37:53Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference [2.9302211589186244]
Large language models (LLMs) have transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations.
Developments in computing and memory capabilities are lagging behind, exacerbated by the discontinuation of Moore's law.
compute-in-memory (CIM) technologies offer a promising solution for accelerating AI inference by directly performing analog computations in memory.
arXiv Detail & Related papers (2024-06-12T16:57:58Z) - Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution [49.902047563260496]
We develop the first attempt to integrate the Vision State Space Model (Mamba) for remote sensing image (RSI) super-resolution.
To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR.
Our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM)
arXiv Detail & Related papers (2024-05-08T11:09:24Z) - Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation [15.35494431928751]
Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving.
We introduce model-attention disaggregation to enhance the efficiency of LLM decoding.
We develop and deploy Lamina, an LLM inference system that incorporates model-attention disaggregation in a distributed heterogeneous cluster.
arXiv Detail & Related papers (2024-05-03T02:15:15Z) - LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition
and Adaptive Quantization [9.517540904818986]
This paper proposes adaptive model quantization and phase-aware partition to improve LLM serving efficiency on heterogeneous GPU clusters.
Experiments on production inference workloads in 11 different clusters demonstrate that LLM-PQ achieves up to 2.88x (2.26x on average) throughput improvement in inference.
arXiv Detail & Related papers (2024-03-02T08:40:07Z) - Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding [11.832919020149891]
This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters.
We propose textbfSmart textbfParallel textbfAuto-textbfCorrect dtextbfEcoding (SPACE)
arXiv Detail & Related papers (2024-02-19T03:39:10Z) - Efficient LLM inference solution on Intel GPU [19.154403468201924]
Transformer based Large Language Models (LLMs) have been widely used in many fields.
We propose an efficient LLM inference solution with low latency and high throughput.
Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput.
arXiv Detail & Related papers (2023-12-19T05:40:43Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references.
It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z) - LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight
Grouping for Multi-Agent Reinforcement Learning [2.0625936401496237]
Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems.
We present a real-time sparse training acceleration system named LearningGroup.
Our system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x.
arXiv Detail & Related papers (2022-10-29T15:09:34Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - H2H: Heterogeneous Model to Heterogeneous System Mapping with
Computation and Communication Awareness [16.244832640402496]
We propose a novel mapping algorithm with both computation and communication awareness.
By slightly trading computation for communication, the system overall latency and energy consumption can be largely reduced.
The superior performance of our work is evaluated based on MAESTRO modeling.
arXiv Detail & Related papers (2022-04-29T02:26:18Z) - LiteTransformerSearch: Training-free On-device Search for Efficient
Autoregressive Language Models [34.673688610935876]
We show that the latency and perplexity pareto-frontier can be found without need for any model training.
We evaluate our method, dubbed Lightweight Transformer Search (LTS), on diverse devices.
We show that the perplexity of Transformer-XL can be achieved with up to 2x lower latency.
arXiv Detail & Related papers (2022-03-04T02:10:43Z) - Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor
Operations on Spatial Accelerators [4.055002321981825]
We present a HW-SW co-design ecosystem for spatial accelerators called Union.
Our framework allows exploring different algorithms and their mappings on several accelerator cost models.
We demonstrate the value of Union for the community with several case studies.
arXiv Detail & Related papers (2021-09-15T16:42:18Z) - GhostSR: Learning Ghost Features for Efficient Image Super-Resolution [49.393251361038025]
Single image super-resolution (SISR) system based on convolutional neural networks (CNNs) achieves fancy performance while requires huge computational costs.
We propose to use shift operation to generate the redundant features (i.e., Ghost features) of SISR models.
We show that both the non-compact and lightweight SISR models embedded in our proposed module can achieve comparable performance to that of their baselines.
arXiv Detail & Related papers (2021-01-21T10:09:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.