Two-stage Pipeline for Multilingual Dialect Detection
- URL: http://arxiv.org/abs/2303.03487v2
- Date: Tue, 28 Mar 2023 17:55:52 GMT
- Title: Two-stage Pipeline for Multilingual Dialect Detection
- Authors: Ankit Vaidya and Aditya Kane
- Abstract summary: This paper outlines our approach to the VarDial 2023 shared task.
We identify three or two dialects from three languages each.
We achieve a score of 58.54% for Track-1 and 85.61% for Track-2.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dialect Identification is a crucial task for localizing various Large
Language Models. This paper outlines our approach to the VarDial 2023 shared
task. Here we have to identify three or two dialects from three languages each
which results in a 9-way classification for Track-1 and 6-way classification
for Track-2 respectively. Our proposed approach consists of a two-stage system
and outperforms other participants' systems and previous works in this domain.
We achieve a score of 58.54% for Track-1 and 85.61% for Track-2. Our codebase
is available publicly (https://github.com/ankit-vaidya19/EACL_VarDial2023).
Related papers
- MasonTigers at SemEval-2024 Task 1: An Ensemble Approach for Semantic Textual Relatedness [5.91695168183101]
This paper presents the MasonTigers entry to the SemEval-2024 Task 1 - Semantic Textual Relatedness.
The task encompasses supervised (Track A), unsupervised (Track B), and cross-lingual (Track C) approaches across 14 different languages.
Our approaches achieved rankings ranging from 11th to 21st in Track A, from 1st to 8th in Track B, and from 5th to 12th in Track C.
arXiv Detail & Related papers (2024-03-22T06:47:42Z) - USTHB at NADI 2023 shared task: Exploring Preprocessing and Feature
Engineering Strategies for Arabic Dialect Identification [0.0]
We investigate the effects of surface preprocessing, morphological preprocessing, FastText vector model, and the weighted concatenation of TF-IDF features.
During the evaluation phase, our system demonstrates noteworthy results, achieving an F1 score of 62.51%.
arXiv Detail & Related papers (2023-12-16T20:23:53Z) - Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through
Dialect Identification using Transformer-based Approach [0.0]
We highlight our methodology for subtask 1 which deals with country-level dialect identification.
The task uses the Twitter dataset (TWT-2023) that encompasses 18 dialects for the multi-class classification problem.
We achieved an F1-score of 76.65 (11th rank on the leaderboard) on the test dataset.
arXiv Detail & Related papers (2023-11-30T17:37:56Z) - DeMuX: Data-efficient Multilingual Learning [57.37123046817781]
DEMUX is a framework that prescribes exact data-points to label from vast amounts of unlabelled multilingual data.
Our end-to-end framework is language-agnostic, accounts for model representations, and supports multilingual target configurations.
arXiv Detail & Related papers (2023-11-10T20:09:08Z) - MarsEclipse at SemEval-2023 Task 3: Multi-Lingual and Multi-Label
Framing Detection with Contrastive Learning [21.616089539381996]
This paper describes our system for SemEval-2023 Task 3 Subtask 2 on Framing Detection.
We used a multi-label contrastive loss for fine-tuning large pre-trained language models in a multi-lingual setting.
Our system was ranked first on the official test set and on the official shared task leaderboard for five of the six languages.
arXiv Detail & Related papers (2023-04-20T18:42:23Z) - MIA 2022 Shared Task: Evaluating Cross-lingual Open-Retrieval Question
Answering for 16 Diverse Languages [54.002969723086075]
We evaluate cross-lingual open-retrieval question answering systems in 16 typologically diverse languages.
The best system leveraging iteratively mined diverse negative examples achieves 32.2 F1, outperforming our baseline by 4.5 points.
The second best system uses entity-aware contextualized representations for document retrieval, and achieves significant improvements in Tamil (20.8 F1), whereas most of the other systems yield nearly zero scores.
arXiv Detail & Related papers (2022-07-02T06:54:10Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Efficient Dialogue State Tracking by Masked Hierarchical Transformer [0.3441021278275805]
We build a Cross-lingual dialog state tracker with a training set in rich resource language and a testing set in low resource language.
We formulate a method for joint learning of slot operation classification task and state tracking task.
arXiv Detail & Related papers (2021-06-28T07:35:49Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z) - Learning to Learn Morphological Inflection for Resource-Poor Languages [105.11499402984482]
We propose to cast the task of morphological inflection - mapping a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem.
Treating each language as a separate task, we use data from high-resource source languages to learn a set of model parameters.
Experiments with two model architectures on 29 target languages from 3 families show that our suggested approach outperforms all baselines.
arXiv Detail & Related papers (2020-04-28T05:13:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.