Neural Models for Offensive Language Detection
- URL: http://arxiv.org/abs/2106.14609v1
- Date: Sun, 30 May 2021 13:02:45 GMT
- Title: Neural Models for Offensive Language Detection
- Authors: Ehab Hamdy
- Abstract summary: Offensive language detection is an ever-growing natural language processing (NLP) application.
We believe contributing to improving and comparing different machine learning models to fight such harmful contents is an important and challenging goal for this thesis.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offensive language detection is an ever-growing natural language processing
(NLP) application. This growth is mainly because of the widespread usage of
social networks, which becomes a mainstream channel for people to communicate,
work, and enjoy entertainment content. Many incidents of sharing aggressive and
offensive content negatively impacted society to a great extend. We believe
contributing to improving and comparing different machine learning models to
fight such harmful contents is an important and challenging goal for this
thesis. We targeted the problem of offensive language detection for building
efficient automated models for offensive language detection. With the recent
advancements of NLP models, specifically, the Transformer model, which tackled
many shortcomings of the standard seq-to-seq techniques. The BERT model has
shown state-of-the-art results on many NLP tasks. Although the literature still
exploring the reasons for the BERT achievements in the NLP field. Other
efficient variants have been developed to improve upon the standard BERT, such
as RoBERTa and ALBERT. Moreover, due to the multilingual nature of text on
social media that could affect the model decision on a given tween, it is
becoming essential to examine multilingual models such as XLM-RoBERTa trained
on 100 languages and how did it compare to unilingual models. The RoBERTa based
model proved to be the most capable model and achieved the highest F1 score for
the tasks. Another critical aspect of a well-rounded offensive language
detection system is the speed at which a model can be trained and make
inferences. In that respect, we have considered the model run-time and
fine-tuned the very efficient implementation of FastText called BlazingText
that achieved good results, which is much faster than BERT-based models.
Related papers
- A Text-to-Text Model for Multilingual Offensive Language Identification [19.23565690468299]
This study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5)
Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks.
Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5.
arXiv Detail & Related papers (2023-12-06T09:37:27Z) - On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based
Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models.
We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - TunBERT: Pretrained Contextualized Text Representation for Tunisian
Dialect [0.0]
We investigate the feasibility of training monolingual Transformer-based language models for under represented languages.
We show that the use of noisy web crawled data instead of structured data is more convenient for such non-standardized language.
Our best performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks.
arXiv Detail & Related papers (2021-11-25T15:49:50Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Can Character-based Language Models Improve Downstream Task Performance
in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi.
We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z) - WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.