Transfer Language Selection for Zero-Shot Cross-Lingual Abusive Language
Detection
- URL: http://arxiv.org/abs/2206.00962v1
- Date: Thu, 2 Jun 2022 09:53:15 GMT
- Title: Transfer Language Selection for Zero-Shot Cross-Lingual Abusive Language
Detection
- Authors: Juuso Eronen, Michal Ptaszynski, Fumito Masui, Masaki Arata, Gniewosz
Leliwa, Michal Wroczynski
- Abstract summary: Instead of preparing a dataset for every language, we demonstrate the effectiveness of cross-lingual transfer learning for zero-shot abusive language detection.
Our datasets are from seven different languages from three language families.
- Score: 2.2998722397348335
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We study the selection of transfer languages for automatic abusive language
detection. Instead of preparing a dataset for every language, we demonstrate
the effectiveness of cross-lingual transfer learning for zero-shot abusive
language detection. This way we can use existing data from higher-resource
languages to build better detection systems for low-resource languages. Our
datasets are from seven different languages from three language families. We
measure the distance between the languages using several language similarity
measures, especially by quantifying the World Atlas of Language Structures. We
show that there is a correlation between linguistic similarity and classifier
performance. This discovery allows us to choose an optimal transfer language
for zero shot abusive language detection.
Related papers
- Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune? [0.0]
This study proposes a method to select languages for instruction tuning in a linguistically informed way.
We use a simple algorithm to choose diverse languages and test their effectiveness on various benchmarks and open-ended questions.
Our results show that this careful selection generally leads to better outcomes than choosing languages at random.
arXiv Detail & Related papers (2024-10-10T10:57:24Z) - CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts [50.44270798959864]
Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages.
We study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language.
arXiv Detail & Related papers (2024-04-19T04:02:50Z) - Measuring Cross-lingual Transfer in Bytes [9.011910726620538]
We show that models from diverse languages perform similarly to a target language in a cross-lingual setting.
We also found evidence that this transfer is not related to language contamination or language proximity.
Our experiments have opened up new possibilities for measuring how much data represents the language-agnostic representations learned during pretraining.
arXiv Detail & Related papers (2024-04-12T01:44:46Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Zero-shot cross-lingual transfer language selection using linguistic
similarity [3.029434408969759]
We study the selection of transfer languages for different Natural Language Processing tasks.
For the study, we used datasets from eight different languages from three language families.
arXiv Detail & Related papers (2023-01-31T15:56:40Z) - Multilingual transfer of acoustic word embeddings improves when training
on languages related to the target zero-resource language [32.170748231414365]
We show that training on even just a single related language gives the largest gain.
We also find that adding data from unrelated languages generally doesn't hurt performance.
arXiv Detail & Related papers (2021-06-24T08:37:05Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.