The performance of multiple language models in identifying offensive
language on social media
- URL: http://arxiv.org/abs/2312.11504v1
- Date: Sun, 10 Dec 2023 18:58:26 GMT
- Title: The performance of multiple language models in identifying offensive
language on social media
- Authors: Hao Li, Brandon Bennett
- Abstract summary: The aim of this research is to use a variety of algorithms to test the ability to identify offensive posts.
The motivation for this project is to reduce the harm of these languages to human censors by automating the screening of offending posts.
- Score: 6.221851249300585
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text classification is an important topic in the field of natural language
processing. It has been preliminarily applied in information retrieval, digital
library, automatic abstracting, text filtering, word semantic discrimination
and many other fields. The aim of this research is to use a variety of
algorithms to test the ability to identify offensive posts and evaluate their
performance against a variety of assessment methods. The motivation for this
project is to reduce the harm of these languages to human censors by automating
the screening of offending posts. The field is a new one, and despite much
interest in the past two years, there has been no focus on the object of the
offence. Through the experiment of this project, it should inspire future
research on identification methods as well as identification content.
Related papers
- Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Muted: Multilingual Targeted Offensive Speech Identification and
Visualization [15.656203119337436]
Muted is a system to identify multilingual HAP content by displaying offensive arguments and their targets using heat maps to indicate their intensity.
We present the model's performance on identifying offensive spans and their targets in existing datasets and present new annotations on German text.
arXiv Detail & Related papers (2023-12-18T16:50:27Z) - When a Language Question Is at Stake. A Revisited Approach to Label
Sensitive Content [0.0]
Article revisits an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war.
We provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus.
arXiv Detail & Related papers (2023-11-17T13:35:10Z) - Multimodal Modeling For Spoken Language Identification [57.94119986116947]
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance.
We propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification.
arXiv Detail & Related papers (2023-09-19T12:21:39Z) - Fine-Tuning Llama 2 Large Language Models for Detecting Online Sexual
Predatory Chats and Abusive Texts [2.406214748890827]
This paper proposes an approach to detection of online sexual predatory chats and abusive language using the open-source pretrained Llama 2 7B- parameter model.
We fine-tune the LLM using datasets with different sizes, imbalance degrees, and languages (i.e., English, Roman Urdu and Urdu)
Experimental results show a strong performance of the proposed approach, which performs proficiently and consistently across three distinct datasets.
arXiv Detail & Related papers (2023-08-28T16:18:50Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Leveraging Multilingual Transformers for Hate Speech Detection [11.306581296760864]
We leverage state of the art Transformer language models to identify hate speech in a multilingual setting.
With a pre-trained multilingual Transformer-based text encoder at the base, we are able to successfully identify and classify hate speech from multiple languages.
arXiv Detail & Related papers (2021-01-08T20:23:50Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Leveraging Foreign Language Labeled Data for Aspect-Based Opinion Mining [1.503974529275767]
We present a supervised aspect-based opinion mining method that utilizes labeled data from a foreign language.
Because aspects and opinions in different languages may be expressed by different words, we propose using word embeddings.
We also introduce an annotated corpus of aspect and sentiment polarities extracted from restaurant reviews in Vietnamese.
arXiv Detail & Related papers (2020-03-15T15:53:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.