WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language
Identification in Code-switched YouTube Comments
- URL: http://arxiv.org/abs/2011.00559v1
- Date: Sun, 1 Nov 2020 16:52:08 GMT
- Title: WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language
Identification in Code-switched YouTube Comments
- Authors: Tharindu Ranasinghe, Sarthak Gupte, Marcos Zampieri, Ifeoma Nwogu
- Abstract summary: This paper describes the WLV-RIT entry to the Hate Speech and Offensive Content Identification in Indo-European languages task 2020.
The HASOC 2020 organizers provided participants with datasets containing social media posts of code-mixed in Dravidian languages (Malayalam-English and Tamil-English)
Our system achieved 0.89 weighted average F1 score for the test set and it ranked 5th place out of 12 participants.
- Score: 16.938836887702923
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes the WLV-RIT entry to the Hate Speech and Offensive
Content Identification in Indo-European Languages (HASOC) shared task 2020. The
HASOC 2020 organizers provided participants with annotated datasets containing
social media posts of code-mixed in Dravidian languages (Malayalam-English and
Tamil-English). We participated in task 1: Offensive comment identification in
Code-mixed Malayalam Youtube comments. In our methodology, we take advantage of
available English data by applying cross-lingual contextual word embeddings and
transfer learning to make predictions to Malayalam data. We further improve the
results using various fine tuning strategies. Our system achieved 0.89 weighted
average F1 score for the test set and it ranked 5th place out of 12
participants.
Related papers
- Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation
System for the WMT22 Translation Task [49.916963624249355]
This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task.
We participate in the general translation task on English$Leftrightarrow$Livonian.
Our system is based on M2M100 with novel techniques that adapt it to the target language pair.
arXiv Detail & Related papers (2022-10-17T04:34:09Z) - IIITDWD-ShankarB@ Dravidian-CodeMixi-HASOC2021: mBERT based model for
identification of offensive content in south Indian languages [0.0]
Task 1 involves identifying offensive content in Malayalam data; Task 2 includes Malayalam and Tamil Code Mixed Sentences.
Our team participated in Task 2.
In our suggested model, we experiment with multilingual BERT to extract features, and three different classifiers are used on extracted features.
arXiv Detail & Related papers (2022-04-13T06:24:57Z) - CALCS 2021 Shared Task: Machine Translation for Code-Switched Data [27.28423961505655]
We address machine translation for code-switched social media data.
We create a community shared task.
For the supervised setting, participants are challenged to translate English into Hindi-English (Eng-Hinglish) in a single direction.
For the unsupervised setting, we provide the following language pairs: English and Spanish-English (Eng-Spanglish), and English and Modern Standard Arabic-Egyptian Arabic (Eng-MSAEA) in both directions.
arXiv Detail & Related papers (2022-02-19T15:39:34Z) - CUSATNLP@HASOC-Dravidian-CodeMix-FIRE2020:Identifying Offensive Language
from ManglishTweets [0.0]
We present a working model submitted for Task2 of the sub-track HASOC Offensive Language Identification- DravidianCodeMix.
It is a message level classification task.
An embedding model-based classifier identifies offensive and not offensive comments in our approach.
arXiv Detail & Related papers (2020-10-17T10:11:41Z) - BRUMS at SemEval-2020 Task 12 : Transformer based Multilingual Offensive
Language Identification in Social Media [9.710464466895521]
We present a multilingual deep learning model to identify offensive language in social media.
The approach achieves acceptable evaluation scores, while maintaining flexibility between languages.
arXiv Detail & Related papers (2020-10-13T10:39:14Z) - Gauravarora@HASOC-Dravidian-CodeMix-FIRE2020: Pre-training ULMFiT on
Synthetically Generated Code-Mixed Data for Hate Speech Detection [0.0]
This paper describes the system submitted to Dravidian-Codemix-HASOC 2020: Hate Speech and Offensive Content Identification in Dravidian languages (Tamil-English and Malayalam-English)
The task aims to identify offensive language in code-mixed dataset of comments/posts in Dravidian languages collected from social media.
arXiv Detail & Related papers (2020-10-05T15:25:47Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.