OCNLI: Original Chinese Natural Language Inference
- URL: http://arxiv.org/abs/2010.05444v1
- Date: Mon, 12 Oct 2020 04:25:48 GMT
- Title: OCNLI: Original Chinese Natural Language Inference
- Authors: Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra Kuebler, Lawrence S.
Moss
- Abstract summary: We present the first large-scale NLI dataset (consisting of 56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI)
Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation.
We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance.
- Score: 21.540733910984006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the tremendous recent progress on natural language inference (NLI),
driven largely by large-scale investment in new datasets (e.g., SNLI, MNLI) and
advances in modeling, most progress has been limited to English due to a lack
of reliable datasets for most of the world's languages. In this paper, we
present the first large-scale NLI dataset (consisting of ~56,000 annotated
sentence pairs) for Chinese called the Original Chinese Natural Language
Inference dataset (OCNLI). Unlike recent attempts at extending NLI to other
languages, our dataset does not rely on any automatic translation or non-expert
annotation. Instead, we elicit annotations from native speakers specializing in
linguistics. We follow closely the annotation protocol used for MNLI, but
create new strategies for eliciting diverse hypotheses. We establish several
baseline results on our dataset using state-of-the-art pre-trained models for
Chinese, and find even the best performing models to be far outpaced by human
performance (~12% absolute performance gap), making it a challenging new
resource that we hope will help to accelerate progress in Chinese NLU. To the
best of our knowledge, this is the first human-elicited MNLI-style corpus for a
non-English language.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - Cross-lingual Transfer or Machine Translation? On Data Augmentation for
Monolingual Semantic Textual Similarity [2.422759879602353]
Cross-lingual transfer of Wikipedia data exhibits improved performance for monolingual STS.
We find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data.
arXiv Detail & Related papers (2024-03-08T12:28:15Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Extending Multilingual Machine Translation through Imitation Learning [60.15671816513614]
Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert.
We show that our approach significantly improves the translation performance between the new and the original languages.
We also demonstrate that our approach is capable of solving copy and off-target problems.
arXiv Detail & Related papers (2023-11-14T21:04:03Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Improving Domain-Specific Retrieval by NLI Fine-Tuning [64.79760042717822]
This article investigates the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking.
We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data.
Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models.
arXiv Detail & Related papers (2023-08-06T12:40:58Z) - IndicXNLI: Evaluating Multilingual Inference for Indian Languages [9.838755823660147]
IndicXNLI is an NLI dataset for 11 Indic languages.
By finetuning different pre-trained LMs on this IndicXNLI, we analyze various cross-lingual transfer techniques.
These experiments provide us with useful insights into the behaviour of pre-trained models for a diverse set of languages.
arXiv Detail & Related papers (2022-04-19T09:49:00Z) - WANLI: Worker and AI Collaboration for Natural Language Inference
Dataset Creation [101.00109827301235]
We introduce a novel paradigm for dataset creation based on human and machine collaboration.
We use dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instruct GPT-3 to compose new examples with similar patterns.
The resulting dataset, WANLI, consists of 108,357 natural language inference (NLI) examples that present unique empirical strengths.
arXiv Detail & Related papers (2022-01-16T03:13:49Z) - Investigating Transfer Learning in Multilingual Pre-trained Language
Models through Chinese Natural Language Inference [11.096793445651313]
We investigate the cross-lingual transfer abilities of XLM-R for Chinese and English natural language inference (NLI)
To better understand linguistic transfer, we created 4 categories of challenge and adversarial tasks for Chinese.
We find that cross-lingual models trained on English NLI do transfer well across our Chinese tasks.
arXiv Detail & Related papers (2021-06-07T22:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.