KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language
Understanding
- URL: http://arxiv.org/abs/2004.03289v3
- Date: Mon, 5 Oct 2020 09:28:51 GMT
- Title: KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language
Understanding
- Authors: Jiyeon Ham, Yo Joong Choe, Kyubyong Park, Ilji Choi, Hyungjoon Soh
- Abstract summary: Natural language inference (NLI) and semantic textual similarity (STS) are key tasks in natural language understanding (NLU)
There are no publicly available NLI or STS datasets in the Korean language.
We construct and release new datasets for Korean NLI and STS, dubbed KorNLI and KorSTS, respectively.
- Score: 4.576330530169462
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Natural language inference (NLI) and semantic textual similarity (STS) are
key tasks in natural language understanding (NLU). Although several benchmark
datasets for those tasks have been released in English and a few other
languages, there are no publicly available NLI or STS datasets in the Korean
language. Motivated by this, we construct and release new datasets for Korean
NLI and STS, dubbed KorNLI and KorSTS, respectively. Following previous
approaches, we machine-translate existing English training sets and manually
translate development and test sets into Korean. To accelerate research on
Korean NLU, we also establish baselines on KorNLI and KorSTS. Our datasets are
publicly available at https://github.com/kakaobrain/KorNLUDatasets.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Compositional Evaluation on Japanese Textual Entailment and Similarity [20.864082353441685]
Natural Language Inference (NLI) and Semantic Textual Similarity (STS) are widely used benchmark tasks for compositional evaluation of pre-trained language models.
Despite growing interest in linguistic universals, most NLI/STS studies have focused almost exclusively on English.
There are no available multilingual NLI/STS datasets in Japanese, which is typologically different from English.
arXiv Detail & Related papers (2022-08-09T15:10:56Z) - KLUE: Korean Language Understanding Evaluation [43.94952771238633]
We introduce Korean Language Understanding Evaluation (KLUE) benchmark.
KLUE is a collection of 8 Korean natural language understanding (NLU) tasks.
We build all of the tasks from scratch from diverse source corpora while respecting copyrights.
arXiv Detail & Related papers (2021-05-20T11:40:30Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - OCNLI: Original Chinese Natural Language Inference [21.540733910984006]
We present the first large-scale NLI dataset (consisting of 56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI)
Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation.
We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance.
arXiv Detail & Related papers (2020-10-12T04:25:48Z) - Mining Knowledge for Natural Language Inference from Wikipedia
Categories [53.26072815839198]
We introduce WikiNLI: a resource for improving model performance on NLI and LE tasks.
It contains 428,899 pairs of phrases constructed from naturally annotated category hierarchies in Wikipedia.
We show that we can improve strong baselines such as BERT and RoBERTa by pretraining them on WikiNLI and transferring the models on downstream tasks.
arXiv Detail & Related papers (2020-10-03T00:45:01Z) - Data and Representation for Turkish Natural Language Inference [6.135815931215188]
We offer a positive response for natural language inference (NLI) in Turkish.
We translate two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels.
We find that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large.
arXiv Detail & Related papers (2020-04-30T17:12:52Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.