Related papers: Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

URL: http://arxiv.org/abs/2408.11815v1
Date: Wed, 21 Aug 2024 17:59:05 GMT
Title: Great Memory, Shallow Reasoning: Limits of $k$NN-LMs
Authors: Shangyi Geng, Wenting Zhao, Alexander M Rush,
Abstract summary: $k$NN-LMs, which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling. We ask whether this improved ability to recall information really translates into downstream abilities.
Score: 71.73611113995143
License: http://creativecommons.org/licenses/by/4.0/
Abstract: $K$-nearest neighbor language models ($k$NN-LMs), which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling as well as downstream NLP benchmarks. These results have led researchers to argue that models trained on poor quality or outdated data could perform well by employing a $k$NN extension that has access to a higher-quality datastore. In this work, we ask whether this improved ability to recall information really translates into downstream abilities. We extensively evaluate $k$NN-LMs on a diverse set of tasks, ranging from sentiment classification and commonsense reasoning to multi-hop reasoning. Results show that $k$NN-LMs excel at memory-intensive tasks, where utilizing the patterns in the input is sufficient for determining the output, but struggle with reasoning tasks that require integrating multiple pieces of information to derive new knowledge. We further demonstrate through oracle experiments and qualitative analysis that even with perfect retrieval, $k$NN-LMs still fail to determine the correct answers, placing an upper bound on their reasoning performance. Code and datastores are released at https://github.com/GSYfate/knnlm-limits/.

Related papers

FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos [18.3012265316413]
We propose a novel entity-level reasoning evaluation framework, $textbfF$ine-grained $textbfE$ntity-level questions. Our model surpasses existing methods by a relative margin of $11.8%$ on GenAI-Bench and $5.5%$ on MonetBench with only 3.3k training videos.
arXiv Detail & Related papers (2025-04-14T16:07:16Z)
Long-Tail Crisis in Nearest Neighbor Language Models [28.842830134043183]
A widely held hypothesis for the success of $k$NN-LM is that its explicit memory, i.e., the datastore, enhances predictions for long-tail phenomena. We investigate the behavior of $k$NN-LM on low-frequency tokens, examining prediction probability, retrieval accuracy, token distribution in the datastore, and approximation error of the product quantization.
arXiv Detail & Related papers (2025-03-28T13:41:07Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs. LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark [53.091690659399234]
knowledge editing on large language models (LLMs) has received considerable attention. The existing LVLM editing benchmark, which comprises three metrics (Reliability, Locality, and Generality), falls short in the quality of synthesized evaluation images. We employ more reliable data collection methods to construct a new Large $textbfV$ision-$textbfL$anguage Model.
arXiv Detail & Related papers (2024-03-12T06:16:33Z)
Learning to Reduce: Optimal Representations of Structured Data in Prompting Large Language Models [42.16047343029512]
Large Language Models (LLMs) have been widely used as general-purpose AI agents. We propose a framework, Learning to Reduce, that fine-tunes a language model to generate a reduced version of an input context. We show that our model achieves comparable accuracies in selecting the relevant evidence from an input context.
arXiv Detail & Related papers (2024-02-22T00:41:23Z)
Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs) We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z)
How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench [52.11481619456093]
We study the performance prediction problem on experiment records from BIG-bench. An $R2$ score greater than 95% indicates the presence of learnable patterns within the experiment records. We find a subset as informative as BIG-bench Hard for evaluating new model families, while being $3times$ smaller.
arXiv Detail & Related papers (2023-05-24T09:35:34Z)
You can't pick your neighbors, or can you? When and how to rely on retrieval in the $k$NN-LM [65.74934004876914]
Retrieval-enhanced language models (LMs) condition their predictions on text retrieved from large external datastores. One such approach, the $k$NN-LM, interpolates any existing LM's predictions with the output of a $k$-nearest neighbors model. We empirically measure the effectiveness of our approach on two English language modeling datasets.
arXiv Detail & Related papers (2022-10-28T02:57:40Z)
Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data [82.92758444543689]
Retrieval-based methods have been shown to be effective in NLP tasks via introducing external knowledge. Surprisingly, we found that REtrieving from the traINing datA (REINA) only can lead to significant gains on multiple NLG and NLU tasks. Experimental results show that this simple method can achieve significantly better performance on a variety of NLU and NLG tasks.
arXiv Detail & Related papers (2022-03-16T17:37:27Z)
Regularized Training of Nearest Neighbor Language Models [10.994336081018043]
We build upon $k$NN-LM citepkhandelwal20generalization, which uses a pre-trained language model together with an exhaustive $k$NN search through the training data (memory bank) to achieve state-of-the-art results. We find that the added L2 regularization seems to improve the performance for high-frequency words without deteriorating the performance for low frequency ones.
arXiv Detail & Related papers (2021-09-16T23:20:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.