Why do Nearest Neighbor Language Models Work?
- URL: http://arxiv.org/abs/2301.02828v1
- Date: Sat, 7 Jan 2023 11:12:36 GMT
- Title: Why do Nearest Neighbor Language Models Work?
- Authors: Frank F. Xu, Uri Alon, Graham Neubig
- Abstract summary: Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context.
Retrieval-augmented LMs have shown to improve over standard neural LMs, by accessing information retrieved from a large datastore.
- Score: 93.71050438413121
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LMs) compute the probability of a text by sequentially
computing a representation of an already-seen context and using this
representation to predict the next word. Currently, most LMs calculate these
representations through a neural network consuming the immediate previous
context. However recently, retrieval-augmented LMs have shown to improve over
standard neural LMs, by accessing information retrieved from a large datastore,
in addition to their standard, parametric, next-word prediction. In this paper,
we set out to understand why retrieval-augmented language models, and
specifically why k-nearest neighbor language models (kNN-LMs) perform better
than standard parametric LMs, even when the k-nearest neighbor component
retrieves examples from the same training set that the LM was originally
trained on. To this end, we perform a careful analysis of the various
dimensions over which kNN-LM diverges from standard LMs, and investigate these
dimensions one by one. Empirically, we identify three main reasons why kNN-LM
performs better than standard LMs: using a different input representation for
predicting the next tokens, approximate kNN search, and the importance of
softmax temperature for the kNN distribution. Further, we incorporate these
insights into the model architecture or the training procedure of the standard
parametric LM, improving its results without the need for an explicit retrieval
component. The code is available at https://github.com/frankxu2004/knnlm-why.
Related papers
- Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources.
NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks.
In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z) - Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval [51.437420003471615]
We propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch.
RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.
arXiv Detail & Related papers (2023-06-23T10:18:02Z) - KNN-LM Does Not Improve Open-ended Text Generation [34.86733697757264]
We study the generation quality of retrieval-augmented language models (LMs)
We find that interpolating with a retrieval distribution actually increases perplexity compared to a baseline Transformer LM.
We discover that the entropy of the retrieval distribution increases faster than that of the base LM as the generated sequence becomes longer.
arXiv Detail & Related papers (2023-05-24T01:48:33Z) - Explaining Emergent In-Context Learning as Kernel Regression [61.57151500616111]
Large language models (LLMs) have initiated a paradigm shift in transfer learning.
In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training.
We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
arXiv Detail & Related papers (2023-05-22T06:45:02Z) - You can't pick your neighbors, or can you? When and how to rely on
retrieval in the $k$NN-LM [65.74934004876914]
Retrieval-enhanced language models (LMs) condition their predictions on text retrieved from large external datastores.
One such approach, the $k$NN-LM, interpolates any existing LM's predictions with the output of a $k$-nearest neighbors model.
We empirically measure the effectiveness of our approach on two English language modeling datasets.
arXiv Detail & Related papers (2022-10-28T02:57:40Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.