WIKITIDE: A Wikipedia-Based Timestamped Definition Pairs Dataset
- URL: http://arxiv.org/abs/2308.03582v2
- Date: Fri, 18 Aug 2023 12:31:52 GMT
- Title: WIKITIDE: A Wikipedia-Based Timestamped Definition Pairs Dataset
- Authors: Hsuvas Borkakoty and Luis Espinosa-Anke
- Abstract summary: We propose WikiTiDe, a dataset derived from pairs of timestamped definitions extracted from Wikipedia.
Our results suggest that bootstrapping the seed version of WikiTiDe leads to better fine-tuned models.
- Score: 12.707584479922833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A fundamental challenge in the current NLP context, dominated by language
models, comes from the inflexibility of current architectures to 'learn' new
information. While model-centric solutions like continual learning or
parameter-efficient fine tuning are available, the question still remains of
how to reliably identify changes in language or in the world. In this paper, we
propose WikiTiDe, a dataset derived from pairs of timestamped definitions
extracted from Wikipedia. We argue that such resource can be helpful for
accelerating diachronic NLP, specifically, for training models able to scan
knowledge resources for core updates concerning a concept, an event, or a named
entity. Our proposed end-to-end method is fully automatic, and leverages a
bootstrapping algorithm for gradually creating a high-quality dataset. Our
results suggest that bootstrapping the seed version of WikiTiDe leads to better
fine-tuned models. We also leverage fine-tuned models in a number of downstream
tasks, showing promising results with respect to competitive baselines.
Related papers
- Novel-WD: Exploring acquisition of Novel World Knowledge in LLMs Using Prefix-Tuning [2.8972337324168014]
We study how PLM may learn and remember new world knowledge facts that do not occur in their pre-training corpus.
We first propose Novel-WD, a new dataset consisting of sentences containing novel facts extracted from recent Wikidata updates.
We make this dataset freely available to the community, and release a procedure to later build new versions of similar datasets with up-to-date information.
arXiv Detail & Related papers (2024-08-30T07:54:50Z) - Robust and Scalable Model Editing for Large Language Models [75.95623066605259]
We propose EREN (Edit models by REading Notes) to improve the scalability and robustness of LLM editing.
Unlike existing techniques, it can integrate knowledge from multiple edits, and correctly respond to syntactically similar but semantically unrelated inputs.
arXiv Detail & Related papers (2024-03-26T06:57:23Z) - Wikiformer: Pre-training with Structured Information of Wikipedia for
Ad-hoc Retrieval [21.262531222066208]
In this paper, we devise four pre-training objectives tailored for information retrieval tasks based on the structured knowledge of Wikipedia.
Compared to existing pre-training methods, our approach can better capture the semantic knowledge in the training corpus.
Experimental results in biomedical and legal domains demonstrate that our approach achieves better performance in vertical domains.
arXiv Detail & Related papers (2023-12-17T09:31:47Z) - Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning
Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning.
Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset.
We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU)
We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z) - Meta-Learning Online Adaptation of Language Models [88.8947656843812]
Large language models encode impressively broad world knowledge in their parameters.
However, the knowledge in static language models falls out of date, limiting the model's effective "shelf life"
arXiv Detail & Related papers (2023-05-24T11:56:20Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Hyperparameter-free Continuous Learning for Domain Classification in
Natural Language Understanding [60.226644697970116]
Domain classification is the fundamental task in natural language understanding (NLU)
Most existing continual learning approaches suffer from low accuracy and performance fluctuation.
We propose a hyper parameter-free continual learning model for text data that can stably produce high performance under various environments.
arXiv Detail & Related papers (2022-01-05T02:46:16Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Learning Neural Models for Natural Language Processing in the Face of
Distributional Shift [10.990447273771592]
The dominating NLP paradigm of training a strong neural predictor to perform one task on a specific dataset has led to state-of-the-art performance in a variety of applications.
It builds upon the assumption that the data distribution is stationary, ie. that the data is sampled from a fixed distribution both at training and test time.
This way of training is inconsistent with how we as humans are able to learn from and operate within a constantly changing stream of information.
It is ill-adapted to real-world use cases where the data distribution is expected to shift over the course of a model's lifetime
arXiv Detail & Related papers (2021-09-03T14:29:20Z) - WikiCheck: An end-to-end open source Automatic Fact-Checking API based
on Wikipedia [1.14219428942199]
We review the State-of-the-Art datasets and solutions for Automatic Fact-checking.
We propose a data filtering method that improves the model's performance and generalization.
We present a new fact-checking system, the textitWikiCheck API that automatically performs a facts validation process based on the Wikipedia knowledge base.
arXiv Detail & Related papers (2021-09-02T10:45:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.