SciNLI: A Corpus for Natural Language Inference on Scientific Text
- URL: http://arxiv.org/abs/2203.06728v2
- Date: Tue, 15 Mar 2022 02:27:08 GMT
- Title: SciNLI: A Corpus for Natural Language Inference on Scientific Text
- Authors: Mobashir Sadat and Cornelia Caragea
- Abstract summary: We introduce SciNLI, a large dataset for NLI that captures the formality in scientific text.
Our best performing model with XLNet achieves a Macro F1 score of only 78.18% and an accuracy of 78.23%.
- Score: 47.293189105900524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Natural Language Inference (NLI) datasets, while being instrumental
in the advancement of Natural Language Understanding (NLU) research, are not
related to scientific text. In this paper, we introduce SciNLI, a large dataset
for NLI that captures the formality in scientific text and contains 107,412
sentence pairs extracted from scholarly papers on NLP and computational
linguistics. Given that the text used in scientific literature differs vastly
from the text used in everyday language both in terms of vocabulary and
sentence structure, our dataset is well suited to serve as a benchmark for the
evaluation of scientific NLU models. Our experiments show that SciNLI is harder
to classify than the existing NLI datasets. Our best performing model with
XLNet achieves a Macro F1 score of only 78.18% and an accuracy of 78.23%
showing that there is substantial room for improvement.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference [65.37685198688538]
This paper presents MSciNLI, a dataset containing 132,320 sentence pairs extracted from five new scientific domains.
We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs)
We show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset.
arXiv Detail & Related papers (2024-04-11T18:12:12Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Improving Domain-Specific Retrieval by NLI Fine-Tuning [64.79760042717822]
This article investigates the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking.
We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data.
Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models.
arXiv Detail & Related papers (2023-08-06T12:40:58Z) - CSL: A Large-scale Chinese Scientific Literature Dataset [30.606855209042603]
We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers.
To our knowledge, CSL is the first scientific document dataset in Chinese. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks.
We present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification.
arXiv Detail & Related papers (2022-09-12T06:10:47Z) - DocNLI: A Large-scale Dataset for Document-level Natural Language
Inference [55.868482696821815]
Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems.
This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI.
arXiv Detail & Related papers (2021-06-17T13:02:26Z) - OCNLI: Original Chinese Natural Language Inference [21.540733910984006]
We present the first large-scale NLI dataset (consisting of 56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI)
Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation.
We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance.
arXiv Detail & Related papers (2020-10-12T04:25:48Z) - FarsTail: A Persian Natural Language Inference Dataset [1.3048920509133808]
Natural language inference (NLI) is one of the central tasks in natural language processing (NLP)
We present a new dataset for the NLI task in the Persian language, also known as Farsi.
This dataset, named FarsTail, includes 10,367 samples which are provided in both the Persian language and the indexed format.
arXiv Detail & Related papers (2020-09-18T13:04:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.