Mining Knowledge for Natural Language Inference from Wikipedia
Categories
- URL: http://arxiv.org/abs/2010.01239v1
- Date: Sat, 3 Oct 2020 00:45:01 GMT
- Title: Mining Knowledge for Natural Language Inference from Wikipedia
Categories
- Authors: Mingda Chen, Zewei Chu, Karl Stratos, Kevin Gimpel
- Abstract summary: We introduce WikiNLI: a resource for improving model performance on NLI and LE tasks.
It contains 428,899 pairs of phrases constructed from naturally annotated category hierarchies in Wikipedia.
We show that we can improve strong baselines such as BERT and RoBERTa by pretraining them on WikiNLI and transferring the models on downstream tasks.
- Score: 53.26072815839198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate lexical entailment (LE) and natural language inference (NLI) often
require large quantities of costly annotations. To alleviate the need for
labeled data, we introduce WikiNLI: a resource for improving model performance
on NLI and LE tasks. It contains 428,899 pairs of phrases constructed from
naturally annotated category hierarchies in Wikipedia. We show that we can
improve strong baselines such as BERT and RoBERTa by pretraining them on
WikiNLI and transferring the models on downstream tasks. We conduct systematic
comparisons with phrases extracted from other knowledge bases such as WordNet
and Wikidata to find that pretraining on WikiNLI gives the best performance. In
addition, we construct WikiNLI in other languages, and show that pretraining on
them improves performance on NLI tasks of corresponding languages.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Cross-lingual Transfer or Machine Translation? On Data Augmentation for
Monolingual Semantic Textual Similarity [2.422759879602353]
Cross-lingual transfer of Wikipedia data exhibits improved performance for monolingual STS.
We find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data.
arXiv Detail & Related papers (2024-03-08T12:28:15Z) - ProMap: Effective Bilingual Lexicon Induction via Language Model
Prompting [22.743097175747575]
We introduce ProMap, a novel approach for bilingual induction (BLI)
ProMap relies on an effective padded prompting of language models with a seed dictionary that achieves good performance when used independently.
When evaluated on both rich-resource and low-resource languages, ProMap consistently achieves state-of-the-art results.
arXiv Detail & Related papers (2023-10-28T18:33:24Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - DocNLI: A Large-scale Dataset for Document-level Natural Language
Inference [55.868482696821815]
Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems.
This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI.
arXiv Detail & Related papers (2021-06-17T13:02:26Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - OCNLI: Original Chinese Natural Language Inference [21.540733910984006]
We present the first large-scale NLI dataset (consisting of 56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI)
Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation.
We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance.
arXiv Detail & Related papers (2020-10-12T04:25:48Z) - Coreferential Reasoning Learning for Language Representation [88.14248323659267]
We present CorefBERT, a novel language representation model that can capture the coreferential relations in context.
The experimental results show that, compared with existing baseline models, CorefBERT can achieve significant improvements consistently on various downstream NLP tasks.
arXiv Detail & Related papers (2020-04-15T03:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.