Can Retriever-Augmented Language Models Reason? The Blame Game Between
the Retriever and the Language Model
- URL: http://arxiv.org/abs/2212.09146v3
- Date: Thu, 2 Nov 2023 19:12:52 GMT
- Title: Can Retriever-Augmented Language Models Reason? The Blame Game Between
the Retriever and the Language Model
- Authors: Parishad BehnamGhader, Santiago Miret, Siva Reddy
- Abstract summary: Augmenting pretrained language models with retrievers has shown promise in effectively solving common NLP problems.
We evaluate the strengths and weaknesses of popular retriever-augmented language models, namely kNN-LM, REALM, DPR + FiD, Contriever + ATLAS, and Contriever + Flan-T5.
- Score: 33.729248437727634
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Augmenting pretrained language models with retrievers has shown promise in
effectively solving common NLP problems, such as language modeling and question
answering. In this paper, we evaluate the strengths and weaknesses of popular
retriever-augmented language models, namely kNN-LM, REALM, DPR + FiD,
Contriever + ATLAS, and Contriever + Flan-T5, in reasoning over retrieved
statements across different tasks. Our findings indicate that the simple
similarity metric employed by retrievers is insufficient for retrieving all the
necessary statements for reasoning. Additionally, the language models do not
exhibit strong reasoning even when provided with only the required statements.
Furthermore, when combined with imperfect retrievers, the performance of the
language models becomes even worse, e.g., Flan-T5's performance drops by 28.6%
when retrieving 5 statements using Contriever. While larger language models
improve performance, there is still a substantial room for enhancement. Our
further analysis indicates that multihop retrieve-and-read is promising for
large language models like GPT-3.5, but does not generalize to other language
models like Flan-T5-xxl.
Related papers
- Robustifying Language Models with Test-Time Adaptation [17.96043752001886]
Large-scale language models achieved state-of-the-art performance over a number of language tasks.
They fail on adversarial language examples, which are sentences optimized to fool the language models but with similar semantic meanings for humans.
We show that we can reverse many language adversarial attacks by adapting the input sentence with predictions from masked words.
arXiv Detail & Related papers (2023-10-29T22:37:54Z) - Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z) - Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken
Language Understanding [13.352795145385645]
Large pretrained language models have demonstrated strong language understanding capabilities.
We evaluate several such models like ChatGPT and OPT of different sizes on multiple benchmarks.
We show, however, that the model is worse at slot filling, and its performance is sensitive to ASR errors.
arXiv Detail & Related papers (2023-05-22T21:59:26Z) - BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer [1.911678487931003]
Retrieval-based language models are increasingly employed in question-answering tasks.
We develop the first Norwegian retrieval-based model by adapting the REALM framework.
We show that this type of training improves the reader's performance on extractive question-answering.
arXiv Detail & Related papers (2023-04-19T13:40:47Z) - REPLUG: Retrieval-Augmented Black-Box Language Models [101.60145719119373]
REPLUG is a retrieval-augmented language modeling framework that treats the language model (LM) as a black box and augments it with a tuneable retrieval model.
We show that REPLUG significantly improves the performance of GPT-3 (175B) on language modeling by 6.3%, as well as the performance of Codex on five-shot MMLU by 5.1%.
arXiv Detail & Related papers (2023-01-30T04:18:09Z) - Structured, flexible, and robust: benchmarking and improving large
language models towards more human-like behavior in out-of-distribution
reasoning tasks [39.39138995087475]
We ask how much of human-like thinking can be captured by learning statistical patterns in language alone.
Our benchmark contains two problem-solving domains (planning and explanation generation) and is designed to require generalization.
We find that humans are far more robust than LLMs on this benchmark.
arXiv Detail & Related papers (2022-05-11T18:14:33Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language.
We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences.
We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.