A simple language-agnostic yet very strong baseline system for hate
speech and offensive content identification
- URL: http://arxiv.org/abs/2202.02511v1
- Date: Sat, 5 Feb 2022 08:09:09 GMT
- Title: A simple language-agnostic yet very strong baseline system for hate
speech and offensive content identification
- Authors: Yves Bestgen
- Abstract summary: A system based on a classical supervised algorithm only fed with character n-grams, and thus completely language-agnostic, is proposed.
It reached a medium performance level in English, the language for which it is easy to develop deep learning approaches.
It ends even first when performances are averaged over the three tasks in these languages, outperforming many deep learning approaches.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For automatically identifying hate speech and offensive content in tweets, a
system based on a classical supervised algorithm only fed with character
n-grams, and thus completely language-agnostic, is proposed by the SATLab team.
After its optimization in terms of the feature weighting and the classifier
parameters, it reached, in the multilingual HASOC 2021 challenge, a medium
performance level in English, the language for which it is easy to develop deep
learning approaches relying on many external linguistic resources, but a far
better level for the two less resourced language, Hindi and Marathi. It ends
even first when performances are averaged over the three tasks in these
languages, outperforming many deep learning approaches. These performances
suggest that it is an interesting reference level to evaluate the benefits of
using more complex approaches such as deep learning or taking into account
complementary resources.
Related papers
- Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs)
It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs.
It achieves superior results with much fewer computational resources compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z) - No Train but Gain: Language Arithmetic for training-free Language Adapters enhancement [59.37775534633868]
We introduce a novel method called language arithmetic, which enables training-free post-processing.
The effectiveness of the proposed solution is demonstrated on three downstream tasks in a MAD-X-based set of cross-lingual schemes.
arXiv Detail & Related papers (2024-04-24T08:52:40Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Cross-lingual hate speech detection based on multilingual
domain-specific word embeddings [4.769747792846004]
We propose to address the problem of multilingual hate speech detection from the perspective of transfer learning.
Our goal is to determine if knowledge from one particular language can be used to classify other language.
We show that the use of our simple yet specific multilingual hate representations improves classification results.
arXiv Detail & Related papers (2021-04-30T02:24:50Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Gamified Crowdsourcing for Idiom Corpora Construction [0.0]
This article introduces a gamified crowdsourcing approach for collecting language learning materials for idiomatic expressions.
A messaging bot is designed as an asynchronous multiplayer game for native speakers who compete with each other.
The approach has been shown to have the potential to speed up the construction of idiom corpora for different natural languages.
arXiv Detail & Related papers (2021-02-01T14:44:43Z) - Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual
Speech Recognition [58.849768879796905]
We propose Adapt-and-Adjust (A2), a transformer-based multi-task learning framework for end-to-end multilingual speech recognition.
The A2 framework overcomes the long-tail problem via three techniques: (1) exploiting a pretrained multilingual language model (mBERT) to improve the performance of low-resource languages; (2) proposing dual adapters consisting of both language-specific and language-agnostic adaptation with minimal additional parameters; and (3) overcoming the class imbalance, either by imposing class priors in the loss during training or adjusting the logits of the softmax output during inference.
arXiv Detail & Related papers (2020-12-03T03:46:16Z) - LSCP: Enhanced Large Scale Colloquial Persian Language Understanding [2.7249643773851724]
"Large Scale Colloquial Persian dataset" aims to describe the colloquial language of low-resourced languages.
The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.
arXiv Detail & Related papers (2020-03-13T22:24:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.