dzNLP at NADI 2024 Shared Task: Multi-Classifier Ensemble with Weighted Voting and TF-IDF Features
- URL: http://arxiv.org/abs/2407.13608v1
- Date: Thu, 18 Jul 2024 15:47:42 GMT
- Title: dzNLP at NADI 2024 Shared Task: Multi-Classifier Ensemble with Weighted Voting and TF-IDF Features
- Authors: Mohamed Lichouri, Khaled Lounnas, Boualem Nadjib Zahaf, Mehdi Ayoub Rabiai,
- Abstract summary: This paper presents the contribution of our dzNLP team to the NADI 2024 shared task.
Our approach, despite its simplicity and reliance on traditional machine learning techniques, demonstrated competitive performance in terms of F1-score and precision.
While our models were highly precise, they struggled to recall a broad range of dialect labels, highlighting a critical area for improvement.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the contribution of our dzNLP team to the NADI 2024 shared task, specifically in Subtask 1 - Multi-label Country-level Dialect Identification (MLDID) (Closed Track). We explored various configurations to address the challenge: in Experiment 1, we utilized a union of n-gram analyzers (word, character, character with word boundaries) with different n-gram values; in Experiment 2, we combined a weighted union of Term Frequency-Inverse Document Frequency (TF-IDF) features with various weights; and in Experiment 3, we implemented a weighted major voting scheme using three classifiers: Linear Support Vector Classifier (LSVC), Random Forest (RF), and K-Nearest Neighbors (KNN). Our approach, despite its simplicity and reliance on traditional machine learning techniques, demonstrated competitive performance in terms of F1-score and precision. Notably, we achieved the highest precision score of 63.22% among the participating teams. However, our overall F1 score was approximately 21%, significantly impacted by a low recall rate of 12.87%. This indicates that while our models were highly precise, they struggled to recall a broad range of dialect labels, highlighting a critical area for improvement in handling diverse dialectal variations.
Related papers
- MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - USTHB at NADI 2023 shared task: Exploring Preprocessing and Feature
Engineering Strategies for Arabic Dialect Identification [0.0]
We investigate the effects of surface preprocessing, morphological preprocessing, FastText vector model, and the weighted concatenation of TF-IDF features.
During the evaluation phase, our system demonstrates noteworthy results, achieving an F1 score of 62.51%.
arXiv Detail & Related papers (2023-12-16T20:23:53Z) - Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through
Dialect Identification using Transformer-based Approach [0.0]
We highlight our methodology for subtask 1 which deals with country-level dialect identification.
The task uses the Twitter dataset (TWT-2023) that encompasses 18 dialects for the multi-class classification problem.
We achieved an F1-score of 76.65 (11th rank on the leaderboard) on the test dataset.
arXiv Detail & Related papers (2023-11-30T17:37:56Z) - Two-stage Pipeline for Multilingual Dialect Detection [0.0]
This paper outlines our approach to the VarDial 2023 shared task.
We identify three or two dialects from three languages each.
We achieve a score of 58.54% for Track-1 and 85.61% for Track-2.
arXiv Detail & Related papers (2023-03-06T20:35:51Z) - UniTE: Unified Translation Evaluation [63.58868113074476]
UniTE is the first unified framework engaged with abilities to handle all three evaluation tasks.
We testify our framework on WMT 2019 Metrics and WMT 2020 Quality Estimation benchmarks.
arXiv Detail & Related papers (2022-04-28T08:35:26Z) - UniCon: Unified Context Network for Robust Active Speaker Detection [111.90529347692723]
We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD)
Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information.
A thorough ablation study is performed on several challenging ASD benchmarks under different settings.
arXiv Detail & Related papers (2021-08-05T13:25:44Z) - Neighborhood Contrastive Learning for Novel Class Discovery [79.14767688903028]
We build a new framework, named Neighborhood Contrastive Learning, to learn discriminative representations that are important to clustering performance.
We experimentally demonstrate that these two ingredients significantly contribute to clustering performance and lead our model to outperform state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-06-20T17:34:55Z) - HuBERT: Self-Supervised Speech Representation Learning by Masked
Prediction of Hidden Units [81.53783563025084]
We propose an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
A key ingredient of our approach is applying the prediction loss over the masked regions only.
HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.
arXiv Detail & Related papers (2021-06-14T14:14:28Z) - Multi-Label Sentiment Analysis on 100 Languages with Dynamic Weighting
for Label Imbalance [0.0]
Cross-lingual sentiment analysis has attracted significant attention due to its applications in various areas including market research, politics and social sciences.
We introduce a sentiment analysis framework in multi-label setting as it obeys Plutchik wheel of emotions.
We show that our method obtains the state-of-the-art performance in 7 of 9 metrics in 3 different languages.
arXiv Detail & Related papers (2020-08-26T14:16:02Z) - Multi-Task Learning for Interpretable Weakly Labelled Sound Event
Detection [34.99472489405047]
This paper proposes a Multi-Task Learning framework for learning from Weakly Labelled Audio data.
We show that the chosen auxiliary task de-noises internal T-F representation and improves SED performance under noisy recordings.
The proposed total framework outperforms existing benchmark models over all SNRs.
arXiv Detail & Related papers (2020-08-17T04:46:25Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.