Related papers: NLP-CUET@DravidianLangTech-EACL2021: Offensive Language Detection from Multilingual Code-Mixed Text using Transformers

NLP-CUET@DravidianLangTech-EACL2021: Offensive Language Detection from Multilingual Code-Mixed Text using Transformers

URL: http://arxiv.org/abs/2103.00455v1
Date: Sun, 28 Feb 2021 11:10:32 GMT
Title: NLP-CUET@DravidianLangTech-EACL2021: Offensive Language Detection from Multilingual Code-Mixed Text using Transformers
Authors: Omar Sharif, Eftekhar Hossain, Mohammed Moshiul Hoque
Abstract summary: This paper presents an automated system that can identify offensive text from multilingual code-mixed data. datasets provided in three languages including Tamil, Malayalam and Kannada code-mixed with English.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The increasing accessibility of the internet facilitated social media usage and encouraged individuals to express their opinions liberally. Nevertheless, it also creates a place for content polluters to disseminate offensive posts or contents. Most of such offensive posts are written in a cross-lingual manner and can easily evade the online surveillance systems. This paper presents an automated system that can identify offensive text from multilingual code-mixed data. In the task, datasets provided in three languages including Tamil, Malayalam and Kannada code-mixed with English where participants are asked to implement separate models for each language. To accomplish the tasks, we employed two machine learning techniques (LR, SVM), three deep learning (LSTM, LSTM+Attention) techniques and three transformers (m-BERT, Indic-BERT, XLM-R) based methods. Results show that XLM-R outperforms other techniques in Tamil and Malayalam languages while m-BERT achieves the highest score in the Kannada language. The proposed models gained weighted $f_1$ score of $0.76$ (for Tamil), $0.93$ (for Malayalam), and $0.71$ (for Kannada) with a rank of $3^{rd}$, $5^{th}$ and $4^{th}$ respectively.

Related papers

Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z)
How do Large Language Models Handle Multilingualism? [81.15060972112563]
This study explores how large language models (LLMs) handle multilingualism. LLMs initially understand the query, converting multilingual inputs into English for task-solving. In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures.
arXiv Detail & Related papers (2024-02-29T02:55:26Z)
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA) We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z)
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation [94.33019040320507]
Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features. Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases. We propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages.
arXiv Detail & Related papers (2022-10-19T12:21:39Z)
bitsa_nlp@LT-EDI-ACL2022: Leveraging Pretrained Language Models for Detecting Homophobia and Transphobia in Social Media Comments [0.9981479937152642]
We present our system for the LT-EDI shared task on detecting homophobia and transphobia in social media comments. We experiment with a number of monolingual and multilingual transformer based models such as mBERT. We observe their performance on a carefully annotated, real life dataset of YouTube comments in English as well as Tamil.
arXiv Detail & Related papers (2022-03-27T10:15:34Z)
IIITT@Dravidian-CodeMix-FIRE2021: Transliterate or translate? Sentiment analysis of code-mixed text in Dravidian languages [0.0]
This research paper bestows a tiny contribution to this research in the form of sentiment analysis of code-mixed social media comments in the popular Dravidian languages Kannada, Tamil and Malayalam. It describes the work for the shared task conducted by Dravidian-CodeMix at FIRE 2021 by employing pre-trained models like ULMFiT and multilingual BERT fine-tuned on the code-mixed dataset. The results are recorded in this research paper where the best models stood 4th, 5th and 10th ranks in the Tamil, Kannada and Malayalam tasks respectively.
arXiv Detail & Related papers (2021-11-15T16:57:59Z)
Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling [0.16252563723817934]
We classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam. A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language. We fine-tune several recent pretrained language models on the newly constructed dataset.
arXiv Detail & Related papers (2021-08-27T08:43:08Z)
SN Computer Science: Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts [2.0305676256390934]
This study presents extensive experiments using multiple deep learning, and transfer learning models to detect offensive content on YouTube. We propose a novel and flexible approach of selective translation and transliteration techniques to reap better results from fine-tuning and ensembling multilingual transformer networks. The proposed model ULMFiT and mBERTBiLSTM yielded good results and are promising for effective offensive speech identification in low-resourced languages.
arXiv Detail & Related papers (2021-08-24T20:23:30Z)
NLP-CUET@LT-EDI-EACL2021: Multilingual Code-Mixed Hope Speech Detection using Cross-lingual Representation Learner [0.0]
We propose three models to identify hope speech in English, Tamil and Malayalam language. Our team has achieved $1st$, $2nd$ and $1st$ rank in these three tasks respectively.
arXiv Detail & Related papers (2021-02-28T11:30:52Z)
Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages. We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC) LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language. We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR) AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities. Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.