FarsTail: A Persian Natural Language Inference Dataset
- URL: http://arxiv.org/abs/2009.08820v2
- Date: Thu, 8 Jul 2021 15:21:54 GMT
- Title: FarsTail: A Persian Natural Language Inference Dataset
- Authors: Hossein Amirkhani, Mohammad AzariJafari, Zohreh Pourjafari, Soroush
Faridan-Jahromi, Zeinab Kouhkan, Azadeh Amirak
- Abstract summary: Natural language inference (NLI) is one of the central tasks in natural language processing (NLP)
We present a new dataset for the NLI task in the Persian language, also known as Farsi.
This dataset, named FarsTail, includes 10,367 samples which are provided in both the Persian language and the indexed format.
- Score: 1.3048920509133808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language inference (NLI) is known as one of the central tasks in
natural language processing (NLP) which encapsulates many fundamental aspects
of language understanding. With the considerable achievements of data-hungry
deep learning methods in NLP tasks, a great amount of effort has been devoted
to develop more diverse datasets for different languages. In this paper, we
present a new dataset for the NLI task in the Persian language, also known as
Farsi, which is one of the dominant languages in the Middle East. This dataset,
named FarsTail, includes 10,367 samples which are provided in both the Persian
language as well as the indexed format to be useful for non-Persian
researchers. The samples are generated from 3,539 multiple-choice questions
with the least amount of annotator interventions in a way similar to the
SciTail dataset. A carefully designed multi-step process is adopted to ensure
the quality of the dataset. We also present the results of traditional and
state-of-the-art methods on FarsTail including different embedding methods such
as word2vec, fastText, ELMo, BERT, and LASER, as well as different modeling
approaches such as DecompAtt, ESIM, HBMP, and ULMFiT to provide a solid
baseline for the future research. The best obtained test accuracy is 83.38%
which shows that there is a big room for improving the current methods to be
useful for real-world NLP applications in different languages. We also
investigate the extent to which the models exploit superficial clues, also
known as dataset biases, in FarsTail, and partition the test set into easy and
hard subsets according to the success of biased models. The dataset is
available at https://github.com/dml-qom/FarsTail
Related papers
- A New Method for Cross-Lingual-based Semantic Role Labeling [5.992526851963307]
A deep learning algorithm is proposed to train semantic role labeling in English and Persian.
The results show significant improvements compared to Niksirt et al.'s model.
The development of cross-lingual methods for semantic role labeling holds promise.
arXiv Detail & Related papers (2024-08-28T16:06:12Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Data-Augmentation-Based Dialectal Adaptation for LLMs [26.72394783468532]
This report presents GMUNLP's participation to the Dialect-Copa shared task at VarDial 2024.
The task focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects.
We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance.
arXiv Detail & Related papers (2024-04-11T19:15:32Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Improving Natural Language Inference in Arabic using Transformer Models
and Linguistically Informed Pre-Training [0.34998703934432673]
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP)
To overcome this limitation, we create a dedicated data set from publicly available resources.
We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches.
arXiv Detail & Related papers (2023-07-27T07:40:11Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - LaoPLM: Pre-trained Language Models for Lao [3.2146309563776416]
Pre-trained language models (PLMs) can capture different levels of concepts in context and hence generate universal language representations.
Although PTMs have been widely used in most NLP applications, it is under-represented in Lao NLP research.
We construct a text classification dataset to alleviate the resource-scare situation of the Lao language.
We present the first transformer-based PTMs for Lao with four versions: BERT-small, BERT-base, ELECTRA-small and ELECTRA-base, and evaluate it over two downstream tasks: part-of-speech tagging and text classification.
arXiv Detail & Related papers (2021-10-12T11:13:07Z) - DocNLI: A Large-scale Dataset for Document-level Natural Language
Inference [55.868482696821815]
Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems.
This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI.
arXiv Detail & Related papers (2021-06-17T13:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.