UIO at SemEval-2023 Task 12: Multilingual fine-tuning for sentiment
classification in low-resource languages
- URL: http://arxiv.org/abs/2304.14189v1
- Date: Thu, 27 Apr 2023 13:51:18 GMT
- Title: UIO at SemEval-2023 Task 12: Multilingual fine-tuning for sentiment
classification in low-resource languages
- Authors: Egil R{\o}nningstad
- Abstract summary: We show how a multilingual large language model can be a resource for sentiment analysis in languages not seen during pretraining.
The languages are to various degrees related to languages used during pretraining, and the language data contain various degrees of code-switching.
We experiment with both monolingual and multilingual datasets for the final fine-tuning, and find that with the provided datasets that contain samples in the thousands, monolingual fine-tuning yields the best results.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our contribution to the 2023 AfriSenti-SemEval shared task 12: Sentiment
Analysis for African Languages, provides insight into how a multilingual large
language model can be a resource for sentiment analysis in languages not seen
during pretraining. The shared task provides datasets of a variety of African
languages from different language families. The languages are to various
degrees related to languages used during pretraining, and the language data
contain various degrees of code-switching. We experiment with both monolingual
and multilingual datasets for the final fine-tuning, and find that with the
provided datasets that contain samples in the thousands, monolingual
fine-tuning yields the best results.
Related papers
- Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - GradSim: Gradient-Based Language Grouping for Effective Multilingual
Training [13.730907708289331]
We propose GradSim, a language grouping method based on gradient similarity.
Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains.
Besides linguistic features, the topics of the datasets play an important role for language grouping.
arXiv Detail & Related papers (2023-10-23T18:13:37Z) - SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic
Classification in 200+ Languages and Dialects [9.501383449039142]
We created SIB-200 -- a large-scale benchmark dataset for topic classification in 200 languages and dialects.
For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for Natural Language Understanding.
We found that languages unseen during the pre-training of multilingual language models, under-represented language families, and languages from the regions of Africa, Americas, Oceania and South East Asia often have the lowest performance on our topic classification dataset.
arXiv Detail & Related papers (2023-09-14T05:56:49Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - DN at SemEval-2023 Task 12: Low-Resource Language Text Classification
via Multilingual Pretrained Language Model Fine-tuning [0.0]
Most existing models and datasets for sentiment analysis are developed for high-resource languages, such as English and Chinese.
The AfriSenti-SemEval 2023 Shared Task 12 aims to fill this gap by evaluating sentiment analysis models on low-resource African languages.
We present our solution to the shared task, where we employed different multilingual XLM-R models with classification head trained on various data.
arXiv Detail & Related papers (2023-05-04T07:28:45Z) - NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language
Selection for Low-Resource Multilingual Sentiment Analysis [11.05909046179595]
This paper describes our system developed for the SemEval-2023 Task 12 "Sentiment Analysis for Low-resource African languages using Twitter dataset"
Our key findings are: Adapting the pretrained model to the target language and task using a small yet relevant corpus improves performance remarkably by more than 10 F1 score points.
In the shared task, our system wins 8 out of 15 tracks and, in particular, performs best in the multilingual evaluation.
arXiv Detail & Related papers (2023-04-28T21:02:58Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - Multilingual transfer of acoustic word embeddings improves when training
on languages related to the target zero-resource language [32.170748231414365]
We show that training on even just a single related language gives the largest gain.
We also find that adding data from unrelated languages generally doesn't hurt performance.
arXiv Detail & Related papers (2021-06-24T08:37:05Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.