MUDES: Multilingual Detection of Offensive Spans
- URL: http://arxiv.org/abs/2102.09665v1
- Date: Thu, 18 Feb 2021 23:19:00 GMT
- Title: MUDES: Multilingual Detection of Offensive Spans
- Authors: Tharindu Ranasinghe, Marcos Zampieri
- Abstract summary: MUDES is a system to detect offensive spans in texts.
It features pre-trained models, a Python API for developers, and a user-friendly web-based interface.
- Score: 3.284443134471233
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The interest in offensive content identification in social media has grown
substantially in recent years. Previous work has dealt mostly with post level
annotations. However, identifying offensive spans is useful in many ways. To
help coping with this important challenge, we present MUDES, a multilingual
system to detect offensive spans in texts. MUDES features pre-trained models, a
Python API for developers, and a user-friendly web-based interface. A detailed
description of MUDES' components is presented in this paper.
Related papers
- OffensiveLang: A Community Based Implicit Offensive Language Dataset [5.813922783967869]
Hate speech or offensive languages exist in both explicit and implicit forms.
OffensiveLang is a community based implicit offensive language dataset.
We present a prompt-based approach that effectively generates implicit offensive languages.
arXiv Detail & Related papers (2024-03-04T20:34:58Z) - Muted: Multilingual Targeted Offensive Speech Identification and
Visualization [15.656203119337436]
Muted is a system to identify multilingual HAP content by displaying offensive arguments and their targets using heat maps to indicate their intensity.
We present the model's performance on identifying offensive spans and their targets in existing datasets and present new annotations on German text.
arXiv Detail & Related papers (2023-12-18T16:50:27Z) - Like a Good Nearest Neighbor: Practical Content Moderation and Text
Classification [66.02091763340094]
Like a Good Nearest Neighbor (LaGoNN) is a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor.
LaGoNN is effective at flagging undesirable content and text classification, and improves the performance of SetFit.
arXiv Detail & Related papers (2023-02-17T15:43:29Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Offense Detection in Dravidian Languages using Code-Mixing Index based
Focal Loss [1.7267596343997798]
Complexity of identifying offensive content is exacerbated by the usage of multiple modalities.
Our model can handle offensive language detection in a low-resource, class imbalanced, multilingual and code mixed setting.
arXiv Detail & Related papers (2021-11-12T19:50:24Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Semi-automatic Generation of Multilingual Datasets for Stance Detection
in Twitter [9.359018642178917]
This paper presents a method to obtain multilingual datasets for stance detection in Twitter.
We leverage user-based information to semi-automatically label large amounts of tweets.
arXiv Detail & Related papers (2021-01-28T13:05:09Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language
Identification [34.57343857418401]
Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification.
In this work, we present SOLID, an expanded dataset, where the tweets were collected in a more principled manner.
We demonstrate that using SOLID along with OLID yields sizable performance gains on the OLID test set for two different models.
arXiv Detail & Related papers (2020-04-29T20:02:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.