ArBanking77: Intent Detection Neural Model and a New Dataset in Modern
and Dialectical Arabic
- URL: http://arxiv.org/abs/2310.19034v1
- Date: Sun, 29 Oct 2023 14:46:11 GMT
- Title: ArBanking77: Intent Detection Neural Model and a New Dataset in Modern
and Dialectical Arabic
- Authors: Mustafa Jarrar, Ahmet Birim, Mohammed Khalilia, Mustafa Erden, Sana
Ghanem
- Abstract summary: This paper presents the ArBanking77, a large Arabic dataset for intent detection in the banking domain.
Our dataset was arabized and localized from the original English Banking77 dataset with 31,404 queries in both Modern Standard Arabic (MSA) and Palestinian dialect.
We present a neural model, based on AraBERT, fine-tuned on ArBanking77, which achieved an F1-score of 0.9209 and 0.8995 on MSA and Palestinian dialect.
- Score: 0.4999814847776097
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents the ArBanking77, a large Arabic dataset for intent
detection in the banking domain. Our dataset was arabized and localized from
the original English Banking77 dataset, which consists of 13,083 queries to
ArBanking77 dataset with 31,404 queries in both Modern Standard Arabic (MSA)
and Palestinian dialect, with each query classified into one of the 77 classes
(intents). Furthermore, we present a neural model, based on AraBERT, fine-tuned
on ArBanking77, which achieved an F1-score of 0.9209 and 0.8995 on MSA and
Palestinian dialect, respectively. We performed extensive experimentation in
which we simulated low-resource settings, where the model is trained on a
subset of the data and augmented with noisy queries to simulate colloquial
terms, mistakes and misspellings found in real NLP systems, especially live
chat queries. The data and the models are publicly available at
https://sina.birzeit.edu/arbanking77.
Related papers
- AlcLaM: Arabic Dialectal Language Model [2.8477895544986955]
We construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms.
We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch.
Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models.
arXiv Detail & Related papers (2024-07-18T02:13:50Z) - AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models [84.65095045762524]
We present three desiderata for a good benchmark for language models.
benchmark reveals new trends in model rankings not shown by previous benchmarks.
We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering.
arXiv Detail & Related papers (2024-07-11T10:03:47Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Arabic Sentiment Analysis with Noisy Deep Explainable Model [48.22321420680046]
This paper proposes an explainable sentiment classification framework for the Arabic language.
The proposed framework can explain specific predictions by training a local surrogate explainable model.
We carried out experiments on public benchmark Arabic SA datasets.
arXiv Detail & Related papers (2023-09-24T19:26:53Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - A Parameter-Efficient Learning Approach to Arabic Dialect Identification
with Pre-Trained General-Purpose Speech Model [9.999900422312098]
We develop a token-level label mapping to condition the GSM for Arabic Dialect Identification (ADI)
We achieve new state-of-the-art accuracy on the ADI-17 dataset by vanilla fine-tuning.
Our study demonstrates how to identify Arabic dialects using a small dataset and limited with open source code and pre-trained models.
arXiv Detail & Related papers (2023-05-18T18:15:53Z) - A Deep CNN Architecture with Novel Pooling Layer Applied to Two Sudanese
Arabic Sentiment Datasets [1.1034493405536276]
Two new publicly available datasets are introduced, the 2-Class Sudanese Sentiment dataset and the 3-Class Sudanese Sentiment dataset.
A CNN architecture, SCM, is proposed, comprising five CNN layers together with a novel pooling layer, MMA, to extract the best features.
The proposed model is applied to the existing Saudi Sentiment dataset and to the MSA Hotel Arabic Review dataset with accuracies 85.55% and 90.01%.
arXiv Detail & Related papers (2022-01-29T21:33:28Z) - Interpreting Arabic Transformer Models [18.98681439078424]
We probe how linguistic information is encoded in Arabic pretrained models, trained on different varieties of Arabic language.
We perform a layer and neuron analysis on the models using three intrinsic tasks: two morphological tagging tasks based on MSA (modern standard Arabic) and dialectal POS-tagging and a dialectal identification task.
arXiv Detail & Related papers (2022-01-19T06:32:25Z) - The Inception Team at NSURL-2019 Task 8: Semantic Question Similarity in
Arabic [0.76146285961466]
This paper describes our method for the task of Semantic Question Similarity in Arabic.
The aim is to build a model that is able to detect similar semantic questions in the Arabic language for the provided dataset.
arXiv Detail & Related papers (2020-04-24T19:52:40Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.