UPB at SemEval-2020 Task 12: Multilingual Offensive Language Detection
on Social Media by Fine-tuning a Variety of BERT-based Models
- URL: http://arxiv.org/abs/2010.13609v2
- Date: Tue, 27 Oct 2020 09:21:21 GMT
- Title: UPB at SemEval-2020 Task 12: Multilingual Offensive Language Detection
on Social Media by Fine-tuning a Variety of BERT-based Models
- Authors: Mircea-Adrian Tanase, Dumitru-Clementin Cercel and Costin-Gabriel
Chiru
- Abstract summary: This paper describes our Transformer-based solutions for identifying offensive language on Twitter in five languages.
It was employed in Subtask A of the Offenseval 2020 shared task.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offensive language detection is one of the most challenging problem in the
natural language processing field, being imposed by the rising presence of this
phenomenon in online social media. This paper describes our Transformer-based
solutions for identifying offensive language on Twitter in five languages
(i.e., English, Arabic, Danish, Greek, and Turkish), which was employed in
Subtask A of the Offenseval 2020 shared task. Several neural architectures
(i.e., BERT, mBERT, Roberta, XLM-Roberta, and ALBERT), pre-trained using both
single-language and multilingual corpora, were fine-tuned and compared using
multiple combinations of datasets. Finally, the highest-scoring models were
used for our submissions in the competition, which ranked our team 21st of 85,
28th of 53, 19th of 39, 16th of 37, and 10th of 46 for English, Arabic, Danish,
Greek, and Turkish, respectively.
Related papers
- ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - tmn at SemEval-2023 Task 9: Multilingual Tweet Intimacy Detection using
XLM-T, Google Translate, and Ensemble Learning [2.28438857884398]
The paper describes a transformer-based system designed for SemEval-2023 Task 9: Multilingual Tweet Intimacy Analysis.
The purpose of the task was to predict the intimacy of tweets in a range from 1 (not intimate at all) to 5 (very intimate)
arXiv Detail & Related papers (2023-04-08T15:50:16Z) - BJTU-WeChat's Systems for the WMT22 Chat Translation Task [66.81525961469494]
This paper introduces the joint submission of the Beijing Jiaotong University and WeChat AI to the WMT'22 chat translation task for English-German.
Based on the Transformer, we apply several effective variants.
Our systems achieve 0.810 and 0.946 COMET scores.
arXiv Detail & Related papers (2022-11-28T02:35:04Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - WOLI at SemEval-2020 Task 12: Arabic Offensive Language Identification
on Different Twitter Datasets [0.0]
A key to fight offensive language on social media is the existence of an automatic offensive language detection system.
In this paper, we describe the system submitted by WideBot AI Lab for the shared task which ranked 10th out of 52 participants with Macro-F1 86.9%.
We also introduced a neural network approach that enhanced the predictive ability of our system that includes CNN, highway network, Bi-LSTM, and attention layers.
arXiv Detail & Related papers (2020-09-11T14:10:03Z) - ANDES at SemEval-2020 Task 12: A jointly-trained BERT multilingual model
for offensive language detection [0.6445605125467572]
We jointly-trained a single model by fine-tuning Multilingual BERT to tackle the task across all the proposed languages.
Our single model had competitive results, with a performance close to top-performing systems.
arXiv Detail & Related papers (2020-08-13T16:07:00Z) - KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech
Identification in Social Media [0.2148535041822524]
We show that combining CNN with BERT is better than using BERT on its own.
We present ArabicBERT, a set of pre-trained transformer language models for Arabic.
arXiv Detail & Related papers (2020-07-26T17:26:20Z) - SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological
Inflection [81.85463892070085]
The SIGMORPHON 2020 task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages.
Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
arXiv Detail & Related papers (2020-06-20T13:24:14Z) - LIIR at SemEval-2020 Task 12: A Cross-Lingual Augmentation Approach for
Multilingual Offensive Language Identification [19.23116755449024]
We adapt and fine-tune the BERT and Multilingual Bert models made available by Google AI for English and non-English languages respectively.
For the English language, we use a combination of two fine-tuned BERT models.
For other languages we propose a cross-lingual augmentation approach in order to enrich training data and we use Multilingual BERT to obtain sentence representations.
arXiv Detail & Related papers (2020-05-07T18:45:48Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.