TuPy-E: detecting hate speech in Brazilian Portuguese social media with
a novel dataset and comprehensive analysis of models
- URL: http://arxiv.org/abs/2312.17704v1
- Date: Fri, 29 Dec 2023 17:47:00 GMT
- Title: TuPy-E: detecting hate speech in Brazilian Portuguese social media with
a novel dataset and comprehensive analysis of models
- Authors: Felipe Oliveira, Victoria Reis, Nelson Ebecken
- Abstract summary: TuPy-E is the largest annotated Portuguese corpus for hate speech detection.
We conduct a detailed analysis using advanced techniques like BERT models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Social media has become integral to human interaction, providing a platform
for communication and expression. However, the rise of hate speech on these
platforms poses significant risks to individuals and communities. Detecting and
addressing hate speech is particularly challenging in languages like Portuguese
due to its rich vocabulary, complex grammar, and regional variations. To
address this, we introduce TuPy-E, the largest annotated Portuguese corpus for
hate speech detection. TuPy-E leverages an open-source approach, fostering
collaboration within the research community. We conduct a detailed analysis
using advanced techniques like BERT models, contributing to both academic
understanding and practical applications
Related papers
- Exploring Large Language Models for Hate Speech Detection in Rioplatense Spanish [0.08192907805418582]
Hate speech detection deals with many language variants, slang, slurs, expression modalities, and cultural nuances.
This work presents a brief analysis of the performance of large language models in the detection of Hate Speech for Rioplatense Spanish.
arXiv Detail & Related papers (2024-10-16T02:32:12Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - KamerRaad: Enhancing Information Retrieval in Belgian National Politics through Hierarchical Summarization and Conversational Interfaces [55.00702535694059]
KamerRaad is an AI tool that leverages large language models to help citizens interactively engage with Belgian political information.
The tool extracts and concisely summarizes key excerpts from parliamentary proceedings, followed by the potential for interaction based on generative AI.
arXiv Detail & Related papers (2024-04-22T15:01:39Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Code-Switching without Switching: Language Agnostic End-to-End Speech
Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem.
By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Cross-lingual Capsule Network for Hate Speech Detection in Social Media [6.531659195805749]
We investigate the cross-lingual hate speech detection task, tackling the problem by adapting the hate speech resources from one language to another.
We propose a cross-lingual capsule network learning model coupled with extra domain-specific lexical semantics for hate speech.
Our model achieves state-of-the-art performance on benchmark datasets from AMI@Evalita 2018 and AMI@Ibereval 2018.
arXiv Detail & Related papers (2021-08-06T12:53:41Z) - Cross-lingual hate speech detection based on multilingual
domain-specific word embeddings [4.769747792846004]
We propose to address the problem of multilingual hate speech detection from the perspective of transfer learning.
Our goal is to determine if knowledge from one particular language can be used to classify other language.
We show that the use of our simple yet specific multilingual hate representations improves classification results.
arXiv Detail & Related papers (2021-04-30T02:24:50Z) - Contextual Lexicon-Based Approach for Hate Speech and Offensive Language
Detection [1.1744028458220426]
This paper presents a new approach for offensive language and hate speech detection on social media.
Our approach incorporates an offensive lexicon composed by implicit and explicit offensive and swearing expressions annotated with binary classes.
Due to the severity of the hate speech and offensive comments in Brazil and the lack of research in Portuguese, Brazilian Portuguese is the language used to validate our method.
arXiv Detail & Related papers (2021-04-25T21:34:51Z) - DeepHate: Hate Speech Detection via Multi-Faceted Text Representations [8.192671048046687]
DeepHate is a novel deep learning model that combines multi-faceted text representations such as word embeddings, sentiments, and topical information.
We conduct extensive experiments and evaluate DeepHate on three large publicly available real-world datasets.
arXiv Detail & Related papers (2021-03-14T16:11:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.