Is this sentence valid? An Arabic Dataset for Commonsense Validation
- URL: http://arxiv.org/abs/2008.10873v1
- Date: Tue, 25 Aug 2020 08:15:55 GMT
- Title: Is this sentence valid? An Arabic Dataset for Commonsense Validation
- Authors: Saja Tawalbeh and Mohammad AL-Smadi
- Abstract summary: This dataset is considered as the first in the field of Arabic text commonsense validation.
The dataset is distributed under the Creative Commons BY-SA 4.0 license and can be found on GitHub.
- Score: 0.456877715768796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The commonsense understanding and validation remains a challenging task in
the field of natural language understanding. Therefore, several research papers
have been published that studied the capability of proposed systems to evaluate
the models ability to validate commonsense in text. In this paper, we present a
benchmark Arabic dataset for commonsense understanding and validation as well
as a baseline research and models trained using the same dataset. To the best
of our knowledge, this dataset is considered as the first in the field of
Arabic text commonsense validation. The dataset is distributed under the
Creative Commons BY-SA 4.0 license and can be found on GitHub.
Related papers
- Improving Natural Language Inference in Arabic using Transformer Models
and Linguistically Informed Pre-Training [0.34998703934432673]
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP)
To overcome this limitation, we create a dedicated data set from publicly available resources.
We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches.
arXiv Detail & Related papers (2023-07-27T07:40:11Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - CHEF: A Pilot Chinese Dataset for Evidence-Based Fact-Checking [55.75590135151682]
CHEF is the first CHinese Evidence-based Fact-checking dataset of 10K real-world claims.
The dataset covers multiple domains, ranging from politics to public health, and provides annotated evidence retrieved from the Internet.
arXiv Detail & Related papers (2022-06-06T09:11:03Z) - ArabGlossBERT: Fine-Tuning BERT on Context-Gloss Pairs for WSD [0.0]
This paper presents our work to fine-tune BERT models for Arabic Word Sense Disambiguation (WSD)
We constructed a dataset of labeled Arabic context-gloss pairs.
Each pair was labeled as True or False and target words in each context were identified and annotated.
arXiv Detail & Related papers (2022-05-19T16:47:18Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Claim Matching Beyond English to Scale Global Fact-Checking [5.836354423653351]
We construct a novel dataset of WhatsApp tipline and public group messages alongside fact-checked claims.
Our dataset contains content in high-resource (English, Hindi) and lower-resource (Bengali, Malayalam, Tamil) languages.
We train our own embedding model using knowledge distillation and a high-quality "teacher" model in order to address the imbalance in embedding quality between the low- and high-resource languages.
arXiv Detail & Related papers (2021-06-01T23:28:05Z) - A Benchmark Arabic Dataset for Commonsense Explanation [0.6091702876917281]
This paper presents a benchmark Arabic dataset for commonsense explanation.
The dataset consists of Arabic sentences that does not make sense along with three choices to select among them the one that explains why the sentence is false.
arXiv Detail & Related papers (2020-12-18T14:07:10Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.