Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive
Language Detection
- URL: http://arxiv.org/abs/2311.02025v1
- Date: Fri, 3 Nov 2023 16:51:07 GMT
- Title: Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive
Language Detection
- Authors: Gretel Liz De la Pe\~na Sarrac\'en, Paolo Rosso, Robert Litschko,
Goran Glava\v{s}, Simone Paolo Ponzetto
- Abstract summary: Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results.
We resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection.
- Score: 19.399281609371258
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-lingual transfer learning from high-resource to medium and low-resource
languages has shown encouraging results. However, the scarcity of resources in
target languages remains a challenge. In this work, we resort to data
augmentation and continual pre-training for domain adaptation to improve
cross-lingual abusive language detection. For data augmentation, we analyze two
existing techniques based on vicinal risk minimization and propose MIXAG, a
novel data augmentation method which interpolates pairs of instances based on
the angle of their representations. Our experiments involve seven languages
typologically distinct from English and three different domains. The results
reveal that the data augmentation strategies can enhance few-shot cross-lingual
abusive language detection. Specifically, we observe that consistently in all
target languages, MIXAG improves significantly in multidomain and multilingual
environments. Finally, we show through an error analysis how the domain
adaptation can favour the class of abusive texts (reducing false negatives),
but at the same time, declines the precision of the abusive language detection
model.
Related papers
- Zero-shot Cross-lingual Stance Detection via Adversarial Language Adaptation [7.242609314791262]
This paper introduces a novel approach to zero-shot cross-lingual stance detection, Multilingual Translation-Augmented BERT (MTAB)
Our technique employs translation augmentation to improve zero-shot performance and pairs it with adversarial learning to further boost model efficacy.
We demonstrate the effectiveness of our proposed approach, showcasing improved results in comparison to a strong baseline model as well as ablated versions of our model.
arXiv Detail & Related papers (2024-04-22T16:56:43Z) - From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models [10.807067327137855]
As language models embrace multilingual capabilities, it's crucial our safety measures keep pace.
In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhance our mitigation techniques.
This allows us to examine the effects of translation quality and the cross-lingual transfer on toxicity mitigation.
arXiv Detail & Related papers (2024-03-06T17:51:43Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - MultiTACRED: A Multilingual Version of the TAC Relation Extraction
Dataset [6.7839993945546215]
We introduce the MultiTACRED dataset, covering 12 typologically diverse languages from 9 language families.
We analyze translation and annotation projection quality, identify error categories, and experimentally evaluate fine-tuned pretrained mono- and multilingual language models.
We find monolingual RE model performance to be comparable to the English original for many of the target languages, and that multilingual models trained on a combination of English and target language data can outperform their monolingual counterparts.
arXiv Detail & Related papers (2023-05-08T09:48:21Z) - A Simple and Effective Method to Improve Zero-Shot Cross-Lingual
Transfer Learning [6.329304732560936]
Existing zero-shot cross-lingual transfer methods rely on parallel corpora or bilingual dictionaries.
We propose Embedding-Push, Attention-Pull, and Robust targets to transfer English embeddings to virtual multilingual embeddings without semantic loss.
arXiv Detail & Related papers (2022-10-18T15:36:53Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.