NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language
Selection for Low-Resource Multilingual Sentiment Analysis
- URL: http://arxiv.org/abs/2305.00090v1
- Date: Fri, 28 Apr 2023 21:02:58 GMT
- Title: NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language
Selection for Low-Resource Multilingual Sentiment Analysis
- Authors: Mingyang Wang, Heike Adel, Lukas Lange, Jannik Str\"otgen and Hinrich
Sch\"utze
- Abstract summary: This paper describes our system developed for the SemEval-2023 Task 12 "Sentiment Analysis for Low-resource African languages using Twitter dataset"
Our key findings are: Adapting the pretrained model to the target language and task using a small yet relevant corpus improves performance remarkably by more than 10 F1 score points.
In the shared task, our system wins 8 out of 15 tracks and, in particular, performs best in the multilingual evaluation.
- Score: 11.05909046179595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes our system developed for the SemEval-2023 Task 12
"Sentiment Analysis for Low-resource African Languages using Twitter Dataset".
Sentiment analysis is one of the most widely studied applications in natural
language processing. However, most prior work still focuses on a small number
of high-resource languages. Building reliable sentiment analysis systems for
low-resource languages remains challenging, due to the limited training data in
this task. In this work, we propose to leverage language-adaptive and
task-adaptive pretraining on African texts and study transfer learning with
source language selection on top of an African language-centric pretrained
language model. Our key findings are: (1) Adapting the pretrained model to the
target language and task using a small yet relevant corpus improves performance
remarkably by more than 10 F1 score points. (2) Selecting source languages with
positive transfer gains during training can avoid harmful interference from
dissimilar languages, leading to better results in multilingual and
cross-lingual settings. In the shared task, our system wins 8 out of 15 tracks
and, in particular, performs best in the multilingual evaluation.
Related papers
- Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - DN at SemEval-2023 Task 12: Low-Resource Language Text Classification
via Multilingual Pretrained Language Model Fine-tuning [0.0]
Most existing models and datasets for sentiment analysis are developed for high-resource languages, such as English and Chinese.
The AfriSenti-SemEval 2023 Shared Task 12 aims to fill this gap by evaluating sentiment analysis models on low-resource African languages.
We present our solution to the shared task, where we employed different multilingual XLM-R models with classification head trained on various data.
arXiv Detail & Related papers (2023-05-04T07:28:45Z) - Masakhane-Afrisenti at SemEval-2023 Task 12: Sentiment Analysis using
Afro-centric Language Models and Adapters for Low-resource African Languages [0.0]
The task aims to perform monolingual sentiment classification (sub-task A) for 12 African languages, multilingual sentiment classification (sub-task B) and zero-shot sentiment classification (task C)
Our findings suggest that using pre-trained Afro-centric language models improves performance for low-resource African languages.
We also ran experiments using adapters for zero-shot tasks, and the results suggest that we can obtain promising results by using adapters with a limited amount of resources.
arXiv Detail & Related papers (2023-04-13T12:54:29Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - Language Chameleon: Transformation analysis between languages using
Cross-lingual Post-training based on Pre-trained language models [4.731313022026271]
In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT)
Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.
arXiv Detail & Related papers (2022-09-14T05:20:52Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Geographical Distance Is The New Hyperparameter: A Case Study Of Finding
The Optimal Pre-trained Language For English-isiZulu Machine Translation [0.0]
This study explores the potential benefits of transfer learning in an English-isiZulu translation framework.
We gathered results from 8 different language corpora, including one multi-lingual corpus, and saw that isiXa-isiZulu outperformed all languages.
We also derived a new coefficient, Nasir's Geographical Distance Coefficient (NGDC) which provides an easy selection of languages for the pre-trained models.
arXiv Detail & Related papers (2022-05-17T20:41:25Z) - Por Qu\'e N\~ao Utiliser Alla Spr{\aa}k? Mixed Training with Gradient
Optimization in Few-Shot Cross-Lingual Transfer [2.7213511121305465]
We propose a one-step mixed training method that trains on both source and target data.
We use one model to handle all target languages simultaneously to avoid excessively language-specific models.
Our proposed method achieves state-of-the-art performance on all tasks and outperforms target-adapting by a large margin.
arXiv Detail & Related papers (2022-04-29T04:05:02Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.