CUSATNLP@HASOC-Dravidian-CodeMix-FIRE2020:Identifying Offensive Language
from ManglishTweets
- URL: http://arxiv.org/abs/2010.08756v1
- Date: Sat, 17 Oct 2020 10:11:41 GMT
- Title: CUSATNLP@HASOC-Dravidian-CodeMix-FIRE2020:Identifying Offensive Language
from ManglishTweets
- Authors: Sara Renjit, Sumam Mary Idicula
- Abstract summary: We present a working model submitted for Task2 of the sub-track HASOC Offensive Language Identification- DravidianCodeMix.
It is a message level classification task.
An embedding model-based classifier identifies offensive and not offensive comments in our approach.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the popularity of social media, communications through blogs, Facebook,
Twitter, and other plat-forms have increased. Initially, English was the only
medium of communication. Fortunately, now we can communicate in any language.
It has led to people using English and their own native or mother tongue
language in a mixed form. Sometimes, comments in other languages have English
transliterated format or other cases; people use the intended language scripts.
Identifying sentiments and offensive content from such code mixed tweets is a
necessary task in these times. We present a working model submitted for Task2
of the sub-track HASOC Offensive Language Identification- DravidianCodeMix in
Forum for Information Retrieval Evaluation, 2020. It is a message level
classification task. An embedding model-based classifier identifies offensive
and not offensive comments in our approach. We applied this method in the
Manglish dataset provided along with the sub-track.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - Offense Detection in Dravidian Languages using Code-Mixing Index based
Focal Loss [1.7267596343997798]
Complexity of identifying offensive content is exacerbated by the usage of multiple modalities.
Our model can handle offensive language detection in a low-resource, class imbalanced, multilingual and code mixed setting.
arXiv Detail & Related papers (2021-11-12T19:50:24Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - PSG@HASOC-Dravidian CodeMixFIRE2021: Pretrained Transformers for
Offensive Language Identification in Tanglish [0.0]
This paper describes the system submitted to Dravidian-Codemix-HASOC2021: Hate Speech and Offensive Language Identification in Dravidian languages.
This task aims to identify offensive content in code-mixed comments/posts in Dravidian languages collected from social media.
arXiv Detail & Related papers (2021-10-06T15:23:40Z) - Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media [1.8899300124593648]
Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world.
Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages.
We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
arXiv Detail & Related papers (2021-05-11T10:02:28Z) - Unsupervised Transfer Learning in Multilingual Neural Machine
Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language.
Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English.
We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z) - Gauravarora@HASOC-Dravidian-CodeMix-FIRE2020: Pre-training ULMFiT on
Synthetically Generated Code-Mixed Data for Hate Speech Detection [0.0]
This paper describes the system submitted to Dravidian-Codemix-HASOC 2020: Hate Speech and Offensive Content Identification in Dravidian languages (Tamil-English and Malayalam-English)
The task aims to identify offensive language in code-mixed dataset of comments/posts in Dravidian languages collected from social media.
arXiv Detail & Related papers (2020-10-05T15:25:47Z) - "Hinglish" Language -- Modeling a Messy Code-Mixed Language [0.0]
This project focuses on using deep learning techniques to tackle a classification problem in categorizing social content written in Hindi-English into Abusive, Hate-Inducing and Not offensive categories.
We utilize bi-directional sequence models with easy text augmentation techniques such as synonym replacement, random insertion, random swap, and random deletion.
arXiv Detail & Related papers (2019-12-30T23:01:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.