USTHB at NADI 2023 shared task: Exploring Preprocessing and Feature
Engineering Strategies for Arabic Dialect Identification
- URL: http://arxiv.org/abs/2312.10536v1
- Date: Sat, 16 Dec 2023 20:23:53 GMT
- Title: USTHB at NADI 2023 shared task: Exploring Preprocessing and Feature
Engineering Strategies for Arabic Dialect Identification
- Authors: Mohamed Lichouri, Khaled Lounnas, Aicha Zitouni, Houda Latrache,
Rachida Djeradi
- Abstract summary: We investigate the effects of surface preprocessing, morphological preprocessing, FastText vector model, and the weighted concatenation of TF-IDF features.
During the evaluation phase, our system demonstrates noteworthy results, achieving an F1 score of 62.51%.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we conduct an in-depth analysis of several key factors
influencing the performance of Arabic Dialect Identification NADI'2023, with a
specific focus on the first subtask involving country-level dialect
identification. Our investigation encompasses the effects of surface
preprocessing, morphological preprocessing, FastText vector model, and the
weighted concatenation of TF-IDF features. For classification purposes, we
employ the Linear Support Vector Classification (LSVC) model. During the
evaluation phase, our system demonstrates noteworthy results, achieving an F1
score of 62.51%. This achievement closely aligns with the average F1 scores
attained by other systems submitted for the first subtask, which stands at
72.91%.
Related papers
- dzNLP at NADI 2024 Shared Task: Multi-Classifier Ensemble with Weighted Voting and TF-IDF Features [0.0]
This paper presents the contribution of our dzNLP team to the NADI 2024 shared task.
Our approach, despite its simplicity and reliance on traditional machine learning techniques, demonstrated competitive performance in terms of F1-score and precision.
While our models were highly precise, they struggled to recall a broad range of dialect labels, highlighting a critical area for improvement.
arXiv Detail & Related papers (2024-07-18T15:47:42Z) - Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation [50.60733773088296]
We conduct a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023)
We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context.
Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF.
arXiv Detail & Related papers (2024-06-06T09:18:42Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through
Dialect Identification using Transformer-based Approach [0.0]
We highlight our methodology for subtask 1 which deals with country-level dialect identification.
The task uses the Twitter dataset (TWT-2023) that encompasses 18 dialects for the multi-class classification problem.
We achieved an F1-score of 76.65 (11th rank on the leaderboard) on the test dataset.
arXiv Detail & Related papers (2023-11-30T17:37:56Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Enhancing Pashto Text Classification using Language Processing
Techniques for Single And Multi-Label Analysis [0.0]
This study aims to establish an automated classification system for Pashto text.
The study achieved an average testing accuracy rate of 94%.
The use of pre-trained language representation models, such as DistilBERT, showed promising results.
arXiv Detail & Related papers (2023-05-04T23:11:31Z) - RuArg-2022: Argument Mining Evaluation [69.87149207721035]
This paper is a report of the organizers on the first competition of argumentation analysis systems dealing with Russian language texts.
A corpus containing 9,550 sentences (comments on social media posts) on three topics related to the COVID-19 pandemic was prepared.
The system that won the first place in both tasks used the NLI (Natural Language Inference) variant of the BERT architecture.
arXiv Detail & Related papers (2022-06-18T17:13:37Z) - An Attention Ensemble Approach for Efficient Text Classification of
Indian Languages [0.0]
This paper focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language.
A hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification.
Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57% and f1-score of 0.8875.
arXiv Detail & Related papers (2021-02-20T07:31:38Z) - NEMO: Frequentist Inference Approach to Constrained Linguistic Typology
Feature Prediction in SIGTYP 2020 Shared Task [83.43738174234053]
We employ frequentist inference to represent correlations between typological features and use this representation to train simple multi-class estimators that predict individual features.
Our best configuration achieved the micro-averaged accuracy score of 0.66 on 149 test languages.
arXiv Detail & Related papers (2020-10-12T19:25:43Z) - OSACT4 Shared Task on Offensive Language Detection: Intensive
Preprocessing-Based Approach [0.0]
This study aims at investigating the impact of the preprocessing phase on text classification for Arabic text.
The Arabic language used in social media is informal and written using Arabic dialects, which makes the text classification task very complex.
An intensive preprocessing-based approach demonstrates its significant impact on offensive language detection and hate speech detection.
arXiv Detail & Related papers (2020-05-14T23:46:10Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.