Efficient Long-range Language Modeling with Self-supervised Causal Retrieval
- URL: http://arxiv.org/abs/2410.01651v1
- Date: Wed, 2 Oct 2024 15:18:34 GMT
- Title: Efficient Long-range Language Modeling with Self-supervised Causal Retrieval
- Authors: Xiang Hu, Zhihao Teng, Wei Wu, Kewei Tu,
- Abstract summary: Grouped Cross-Attention is a novel module enabling joint pre-training of the retriever and causal LM.
By integrating top-$k$ retrieval, our model can be pre-trained efficiently from scratch with context lengths up to 64K tokens.
- Score: 39.24972628990943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, retrieval-based language models (RLMs) have received much attention. However, most of them leverage a pre-trained retriever with fixed parameters, which may not adapt well to causal language models. In this work, we propose Grouped Cross-Attention, a novel module enabling joint pre-training of the retriever and causal LM, and apply it to long-context modeling. For a given input sequence, we split it into chunks and use the current chunk to retrieve past chunks for subsequent text generation. Our innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner. By integrating top-$k$ retrieval, our model can be pre-trained efficiently from scratch with context lengths up to 64K tokens. Our experiments show our model, compared with long-range LM baselines, can achieve lower perplexity with comparable or lower pre-training and inference costs.
Related papers
- Learning to Retrieve Iteratively for In-Context Learning [56.40100968649039]
iterative retrieval is a novel framework that empowers retrievers to make iterative decisions through policy optimization.
We instantiate an iterative retriever for composing in-context learning exemplars and apply it to various semantic parsing tasks.
By adding only 4M additional parameters for state encoding, we convert an off-the-shelf dense retriever into a stateful iterative retriever.
arXiv Detail & Related papers (2024-06-20T21:07:55Z) - Simple and Scalable Strategies to Continually Pre-train Large Language Models [20.643648785602462]
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available.
We show that a simple and scalable combination of learning rate re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch.
arXiv Detail & Related papers (2024-03-13T17:58:57Z) - Boot and Switch: Alternating Distillation for Zero-Shot Dense Retrieval [50.47192086219752]
$texttABEL$ is a simple but effective unsupervised method to enhance passage retrieval in zero-shot settings.
By either fine-tuning $texttABEL$ on labelled data or integrating it with existing supervised dense retrievers, we achieve state-of-the-art results.
arXiv Detail & Related papers (2023-11-27T06:22:57Z) - MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models [40.992566245706996]
We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens.
We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters.
Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
arXiv Detail & Related papers (2023-10-30T13:33:21Z) - Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval [51.437420003471615]
We propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch.
RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.
arXiv Detail & Related papers (2023-06-23T10:18:02Z) - Generate to Understand for Representation [3.5325087487696463]
GUR is a pretraining framework that combines language modeling and contrastive learning objectives in a single training step.
GUR achieves impressive results without any labeled training data, outperforming all other pretrained baselines as a retriever at the recall benchmark in a zero-shot setting.
arXiv Detail & Related papers (2023-06-14T06:00:18Z) - Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented
Large Language Models [6.425088990363101]
We examine the relationship between fluency and attribution in Large Language Models prompted with retrieved evidence.
We show that larger models tend to do much better in both fluency and attribution.
We propose a recipe that could allow smaller models to both close the gap with larger models and preserve the benefits of top-k retrieval.
arXiv Detail & Related papers (2023-02-11T02:43:34Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.