Efficient Beam Search for Large Language Models Using Trie-Based Decoding
- URL: http://arxiv.org/abs/2502.00085v1
- Date: Fri, 31 Jan 2025 16:22:36 GMT
- Title: Efficient Beam Search for Large Language Models Using Trie-Based Decoding
- Authors: Brian J Chan, Jui-Hung Cheng, Mao Xun Huang, Chao-Ting Chen, Hen-Hsen Huang,
- Abstract summary: We introduce a novel parallel decoding method that addresses the memory inefficiency of batch-based beam search.<n>By sharing a single cache among all beams that share the same prefix, the proposed method not only reduces memory consumption dramatically but also enables parallel decoding across all branches.<n>This innovative use of a prefix tree offers an efficient alternative for beam search, achieving significant memory savings while preserving inference speed, making it particularly well-suited for memory-constrained environments or large-scale model deployments.
- Score: 10.302821791274129
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In Transformer-based sequence-to-sequence generation, beam search has proven effective in enhancing the quality of generated sequences compared to greedy decoding. Conventional beam search methods typically adopt either a sequential or batch-based approach. The sequential approach, while memory-efficient, requires multiple decoding passes to construct a complete search tree, leading to significantly slower inference. On the other hand, the batch-based approach enables parallel computation across beams, but at the expense of high memory consumption due to the need to maintain separate key-value (KV) caches for each beam. In this study, we introduce a novel trie (prefix-tree)-based parallel decoding method that addresses the memory inefficiency of batch-based beam search. By sharing a single KV cache among all beams that share the same prefix, the proposed method not only reduces memory consumption dramatically but also enables parallel decoding across all branches. This innovative use of a prefix tree offers an efficient alternative for beam search, achieving significant memory savings while preserving inference speed, making it particularly well-suited for memory-constrained environments or large-scale model deployments.
Related papers
- PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [65.36715026409873]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost.<n>We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference [35.730941605490194]
Large language models (LLMs) have shown outstanding performance across numerous real-world tasks.
Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens.
This paper explores the novel integration of speculative decoding with beam sampling.
arXiv Detail & Related papers (2024-09-25T02:20:42Z) - An Efficient Procedure for Computing Bayesian Network Structure Learning [0.9208007322096532]
We propose a globally optimal Bayesian network structure discovery algorithm based on a progressively leveled scoring approach.
Experimental results indicate that our method, when using only memory, not only reduces peak memory usage but also improves computational efficiency.
arXiv Detail & Related papers (2024-07-24T07:59:18Z) - HashReID: Dynamic Network with Binary Codes for Efficient Person
Re-identification [3.3372444460738357]
Biometric applications, such as person re-identification (ReID), are often deployed on energy constrained devices.
While recent ReID methods prioritize high retrieval performance, they often come with large computational costs and high search time.
We propose an input-adaptive network with multiple exit blocks, that can terminate early if the retrieval is straightforward or noisy.
arXiv Detail & Related papers (2023-08-23T04:01:54Z) - Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix
Factorization [60.91600465922932]
We present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder.
Our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods.
arXiv Detail & Related papers (2022-10-23T00:32:04Z) - Asymmetric Scalable Cross-modal Hashing [51.309905690367835]
Cross-modal hashing is a successful method to solve large-scale multimedia retrieval issue.
We propose a novel Asymmetric Scalable Cross-Modal Hashing (ASCMH) to address these issues.
Our ASCMH outperforms the state-of-the-art cross-modal hashing methods in terms of accuracy and efficiency.
arXiv Detail & Related papers (2022-07-26T04:38:47Z) - Recall@k Surrogate Loss with Large Batches and Similarity Mixup [62.67458021725227]
Direct optimization, by gradient descent, of an evaluation metric is not possible when it is non-differentiable.
In this work, a differentiable surrogate loss for the recall is proposed.
The proposed method achieves state-of-the-art results in several image retrieval benchmarks.
arXiv Detail & Related papers (2021-08-25T11:09:11Z) - A Genetic Algorithm for Obtaining Memory Constrained Near-Perfect
Hashing [0.0]
We present an approach based on hash tables that focuses on both minimizing the number of comparisons performed during the search and minimizing the total collection size.
The paper results show that near-perfect hashing is faster than binary search, yet uses less memory than perfect hashing.
arXiv Detail & Related papers (2020-07-16T12:57:15Z) - Best-First Beam Search [78.71330480725668]
We show that the standard implementation of beam search can be made up to 10x faster in practice.
We propose a memory-reduced variant of Best-First Beam Search, which has a similar beneficial search bias in terms of downstream performance.
arXiv Detail & Related papers (2020-07-08T05:56:01Z) - Reinforcing Short-Length Hashing [61.75883795807109]
Existing methods have poor performance in retrieval using an extremely short-length hash code.
In this study, we propose a novel reinforcing short-length hashing (RSLH)
In this proposed RSLH, mutual reconstruction between the hash representation and semantic labels is performed to preserve the semantic information.
Experiments on three large-scale image benchmarks demonstrate the superior performance of RSLH under various short-length hashing scenarios.
arXiv Detail & Related papers (2020-04-24T02:23:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.