StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in
Question Answering Models
- URL: http://arxiv.org/abs/2205.11388v1
- Date: Mon, 23 May 2022 15:33:41 GMT
- Title: StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in
Question Answering Models
- Authors: Adam Li\v{s}ka, Tom\'a\v{s} Ko\v{c}isk\'y, Elena Gribovskaya, Tayfun
Terzi, Eren Sezener, Devang Agrawal, Cyprien de Masson d'Autume, Tim
Scholtes, Manzil Zaheer, Susannah Young, Ellen Gilsenan-McMahon, Sophia
Austin, Phil Blunsom, Angeliki Lazaridou
- Abstract summary: We construct a new large-scale dataset, StreamingQA, with human written and generated questions asked on a given date.
We evaluate our models quarterly as they read new articles not seen in pre-training.
We show that parametric models can be updated without full retraining, while avoiding catastrophic forgetting.
- Score: 31.43391633383255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge and language understanding of models evaluated through question
answering (QA) has been usually studied on static snapshots of knowledge, like
Wikipedia. However, our world is dynamic, evolves over time, and our models'
knowledge becomes outdated. To study how semi-parametric QA models and their
underlying parametric language models (LMs) adapt to evolving knowledge, we
construct a new large-scale dataset, StreamingQA, with human written and
generated questions asked on a given date, to be answered from 14 years of
time-stamped news articles. We evaluate our models quarterly as they read new
articles not seen in pre-training. We show that parametric models can be
updated without full retraining, while avoiding catastrophic forgetting. For
semi-parametric models, adding new articles into the search space allows for
rapid adaptation, however, models with an outdated underlying LM under-perform
those with a retrained LM. For questions about higher-frequency named entities,
parametric updates are particularly beneficial. In our dynamic world, the
StreamingQA dataset enables a more realistic evaluation of QA models, and our
experiments highlight several promising directions for future research.
Related papers
- Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence [3.566250952750758]
We introduce the Dynamic Intelligence Assessment (DIA), a novel methodology for testing AI models.
Our framework introduces four new metrics to assess a model's reliability and confidence across multiple attempts.
The accompanying DIA-Bench dataset is presented in various formats such as text, PDFs, compiled binaries, and visual puzzles.
arXiv Detail & Related papers (2024-10-20T20:07:36Z) - Novel-WD: Exploring acquisition of Novel World Knowledge in LLMs Using Prefix-Tuning [2.8972337324168014]
We study how PLM may learn and remember new world knowledge facts that do not occur in their pre-training corpus.
We first propose Novel-WD, a new dataset consisting of sentences containing novel facts extracted from recent Wikidata updates.
We make this dataset freely available to the community, and release a procedure to later build new versions of similar datasets with up-to-date information.
arXiv Detail & Related papers (2024-08-30T07:54:50Z) - Towards Better Generalization in Open-Domain Question Answering by Mitigating Context Memorization [67.92796510359595]
Open-domain Question Answering (OpenQA) aims at answering factual questions with an external large-scale knowledge corpus.
It is still unclear how well an OpenQA model can transfer to completely new knowledge domains.
We introduce Corpus-Invariant Tuning (CIT), a simple but effective training strategy, to mitigate the knowledge over-memorization.
arXiv Detail & Related papers (2024-04-02T05:44:50Z) - Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models [74.81091933317882]
We introduce EvolvingQA, a temporally evolving question-answering benchmark designed for training and evaluating LMs on an evolving Wikipedia database.
We uncover that existing continual learning baselines suffer from updating and removing outdated knowledge.
Our work aims to model the dynamic nature of real-world information, suggesting faithful evaluations of the evolution-adaptability of language models.
arXiv Detail & Related papers (2023-11-14T12:12:02Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering [26.34649731975005]
Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for question answering (QA)
While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics unreliable for accurately quantifying model performance.
We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness) and 2) whether they produce a response based on the provided knowledge (faithfulness)
arXiv Detail & Related papers (2023-07-31T17:41:00Z) - Can LMs Learn New Entities from Descriptions? Challenges in Propagating
Injected Knowledge [72.63368052592004]
We study LMs' abilities to make inferences based on injected facts (or propagate those facts)
We find that existing methods for updating knowledge show little propagation of injected knowledge.
Yet, prepending entity definitions in an LM's context improves performance across all settings.
arXiv Detail & Related papers (2023-05-02T17:59:46Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Automatic Short Math Answer Grading via In-context Meta-learning [2.0263791972068628]
We study the problem of automatic short answer grading for students' responses to math questions.
We use MathBERT, a variant of the popular language model BERT adapted to mathematical content, as our base model.
Second, we use an in-context learning approach that provides scoring examples as input to the language model.
arXiv Detail & Related papers (2022-05-30T16:26:02Z) - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [133.93803565077337]
retrieval-augmented generation models combine pre-trained parametric and non-parametric memory for language generation.
We show that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
arXiv Detail & Related papers (2020-05-22T21:34:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.