Related papers: HiTZ at VarDial 2025 NorSID: Overcoming Data Scarcity with Language Transfer and Automatic Data Annotation

HiTZ at VarDial 2025 NorSID: Overcoming Data Scarcity with Language Transfer and Automatic Data Annotation

URL: http://arxiv.org/abs/2412.10095v2
Date: Thu, 09 Jan 2025 09:09:32 GMT
Title: HiTZ at VarDial 2025 NorSID: Overcoming Data Scarcity with Language Transfer and Automatic Data Annotation
Authors: Jaione Bengoetxea, Mikel Zubillaga, Ekhi Azurmendi, Maite Heredia, Julen Etxaniz, Markel Ferro, Jeremy Barnes,
Abstract summary: We present our submission for the NorSID Shared Task, consisting of three tasks: Intent Detection, Slot Filling and Dialect Identification.<n>For Intent Detection and Slot Filling, we have fine-tuned a multitask model in a cross-lingual setting, to leverage the xSID dataset available in 17 languages.<n>In the case of Dialect Identification, our final submission consists of a model fine-tuned on the provided development set, which has obtained the highest scores.
Score: 5.989003825349711
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper we present our submission for the NorSID Shared Task as part of the 2025 VarDial Workshop (Scherrer et al., 2025), consisting of three tasks: Intent Detection, Slot Filling and Dialect Identification, evaluated using data in different dialects of the Norwegian language. For Intent Detection and Slot Filling, we have fine-tuned a multitask model in a cross-lingual setting, to leverage the xSID dataset available in 17 languages. In the case of Dialect Identification, our final submission consists of a model fine-tuned on the provided development set, which has obtained the highest scores within our experiments. Our final results on the test set show that our models do not drop in performance compared to the development set, likely due to the domain-specificity of the dataset and the similar distribution of both subsets. Finally, we also report an in-depth analysis of the provided datasets and their artifacts, as well as other sets of experiments that have been carried out but did not yield the best results. Additionally, we present an analysis on the reasons why some methods have been more successful than others; mainly the impact of the combination of languages and domain-specificity of the training data on the results.

Related papers

Add Noise, Tasks, or Layers? MaiNLP at the VarDial 2025 Shared Task on Norwegian Dialectal Slot and Intent Detection [22.89563355840371]
Slot and intent detection is a classic natural language understanding task. Many approaches for low-resource scenarios have not yet been applied to dialectal SID data. We participate in the VarDial 2025 shared task on slot and intent detection in Norwegian varieties.
arXiv Detail & Related papers (2025-01-07T15:36:35Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks. We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Data-Augmentation-Based Dialectal Adaptation for LLMs [26.72394783468532]
This report presents GMUNLP's participation to the Dialect-Copa shared task at VarDial 2024. The task focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects. We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance.
arXiv Detail & Related papers (2024-04-11T19:15:32Z)
CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task Information Retrieval [5.97515243922116]
We present the Charles University system for the MRL2023 Shared Task on Multi-lingual Multi-task Information Retrieval. The goal of the shared task was to develop systems for named entity recognition and question answering in several under-represented languages. Our solutions to both subtasks rely on the translate-test approach.
arXiv Detail & Related papers (2023-10-25T10:22:49Z)
A deep Natural Language Inference predictor without language-specific training data [44.26507854087991]
We present a technique of NLP to tackle the problem of inference relation (NLI) between pairs of sentences in a target language of choice without a language-specific training dataset. We exploit a generic translation dataset, manually translated, along with two instances of the same pre-trained model. The model has been evaluated over machine translated Stanford NLI test dataset, machine translated Multi-Genre NLI test dataset, and manually translated RTE3-ITA test dataset.
arXiv Detail & Related papers (2023-09-06T10:20:59Z)
An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages. We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z)
Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models. We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks. OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z)
DN at SemEval-2023 Task 12: Low-Resource Language Text Classification via Multilingual Pretrained Language Model Fine-tuning [0.0]
Most existing models and datasets for sentiment analysis are developed for high-resource languages, such as English and Chinese. The AfriSenti-SemEval 2023 Shared Task 12 aims to fill this gap by evaluating sentiment analysis models on low-resource African languages. We present our solution to the shared task, where we employed different multilingual XLM-R models with classification head trained on various data.
arXiv Detail & Related papers (2023-05-04T07:28:45Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
OCHADAI at SemEval-2022 Task 2: Adversarial Training for Multilingual Idiomaticity Detection [4.111899441919165]
We propose a multilingual adversarial training model for determining whether a sentence contains an idiomatic expression. Our model relies on pre-trained contextual representations from different multi-lingual state-of-the-art transformer-based language models.
arXiv Detail & Related papers (2022-06-07T05:52:43Z)
A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities. We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention. Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z)
Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding. COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.