BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models
- URL: http://arxiv.org/abs/2310.01329v2
- Date: Fri, 3 May 2024 05:41:55 GMT
- Title: BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models
- Authors: Qingqing Cao, Sewon Min, Yizhong Wang, Hannaneh Hajishirzi,
- Abstract summary: Retrieval augmentation addresses many critical problems in large language models.
Running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text.
We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages.
- Score: 77.0501668780182
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval augmentation addresses many critical problems in large language models such as hallucination, staleness, and privacy leaks. However, running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text. We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages, significantly reducing computation during inference. Despite the potential loss of accuracy, our new calibration techniques and training objectives restore performance. Combined with offline and runtime compression, this only requires 127GB of disk space for encoding 3 billion tokens in Wikipedia. Our experiments show that on five knowledge-intensive NLP tasks, BTR accelerates state-of-the-art inference by up to 4x and reduces storage by over 100x while maintaining over 95% task performance.
Related papers
- InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU [48.105361428245736]
We introduce InfiniteHiP, an inference framework for large language models (LLMs)
We dynamically eliminate irrelevant context tokens through a modular hierarchical token pruning algorithm.
Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training.
arXiv Detail & Related papers (2025-02-13T02:52:01Z) - Adjoint sharding for very long context training of state space models [7.723642550918118]
Adjoint sharding is a technique that comprises sharding gradient calculation during training to reduce memory requirements by orders of magnitude.
We show the proposed adjoint sharding algorithm reduces memory usage by up to 3X with a 1.27B parameter large language model on 1M context length training.
This allows to increase the maximum context length during training or fine-tuning of a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances.
arXiv Detail & Related papers (2025-01-01T01:10:59Z) - SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator [65.62084602011596]
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks.
We have identified a key pattern: certain seemingly meaningless special tokens (i.e., separators) contribute disproportionately to attention scores compared to semantically meaningful tokens.
We introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens.
arXiv Detail & Related papers (2024-12-16T18:58:57Z) - MrT5: Dynamic Token Merging for Efficient Byte-level Language Models [50.46453950887946]
This work introduces MrT5 (MergeT5), a more efficient variant of ByT5.
MrT5 integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length.
When trained on English text, MrT5 demonstrates the capability to transfer its deletion feature zero-shot across several languages.
arXiv Detail & Related papers (2024-10-28T06:14:12Z) - Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling [53.58854856174773]
Speculative decoding is an approach to accelerate inference through a guess-and-verify paradigm.
Token Recycling stores candidate tokens in an adjacency matrix and employs a breadth-first search algorithm.
It significantly outperforms existing train-free methods by 30% and even a training method by 25%.
arXiv Detail & Related papers (2024-08-16T12:20:56Z) - VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections [35.133698935322634]
Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks.
We identify and characterise the important components needed for effective model convergence using gradient descent.
This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs.
arXiv Detail & Related papers (2024-05-28T09:23:14Z) - BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via
Self-Distillation [13.262366437264188]
BitDistiller is a framework that synergizes Quantization-Aware Training (QAT) with Knowledge Distillation (KD) to boost the performance of Large Language Models (LLMs)
Specifically, BitDistiller first incorporates a tailored asymmetric quantization and clipping technique to maximally preserve the fidelity of quantized weights, and then proposes a novel Confidence-Aware Kullback-Leibler Divergence (CAKLD) objective.
Empirical evaluations demonstrate that BitDistiller significantly surpasses existing methods in both 3-bit and 2-bit configurations on general language understanding and complex reasoning benchmarks.
arXiv Detail & Related papers (2024-02-16T12:27:15Z) - Memory Augmented Lookup Dictionary based Language Modeling for Automatic
Speech Recognition [20.926163659469587]
We propose a new memory augmented lookup dictionary based Transformer architecture for LM.
The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens.
Our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate.
arXiv Detail & Related papers (2022-12-30T22:26:57Z) - Improving language models by retrieving from trillions of tokens [50.42630445476544]
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus.
With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile.
arXiv Detail & Related papers (2021-12-08T17:32:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.