indicnlp@kgp at DravidianLangTech-EACL2021: Offensive Language
Identification in Dravidian Languages
- URL: http://arxiv.org/abs/2102.07150v1
- Date: Sun, 14 Feb 2021 13:24:01 GMT
- Title: indicnlp@kgp at DravidianLangTech-EACL2021: Offensive Language
Identification in Dravidian Languages
- Authors: Kushal Kedia, Abhilash Nandy
- Abstract summary: The paper presents the submission of the team indicnlp@kgp to the EACL 2021 shared task "Offensive Language Identification in Dravidian languages"
The task aimed to classify different offensive content types in 3 code-mixed Dravidian language datasets.
We achieved weighted-average F1 scores of 0.97, 0.77, and 0.72 in the Malayalam-English, Tamil-English, and Kannada-English datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The paper presents the submission of the team indicnlp@kgp to the EACL 2021
shared task "Offensive Language Identification in Dravidian Languages." The
task aimed to classify different offensive content types in 3 code-mixed
Dravidian language datasets. The work leverages existing state of the art
approaches in text classification by incorporating additional data and transfer
learning on pre-trained models. Our final submission is an ensemble of an
AWD-LSTM based model along with 2 different transformer model architectures
based on BERT and RoBERTa. We achieved weighted-average F1 scores of 0.97,
0.77, and 0.72 in the Malayalam-English, Tamil-English, and Kannada-English
datasets ranking 1st, 2nd, and 3rd on the respective tasks.
Related papers
- Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - PICT@DravidianLangTech-ACL2022: Neural Machine Translation On Dravidian
Languages [1.0066310107046081]
We carried out neural machine translation for the following five language pairs.
The datasets for each of the five language pairs were used to train various translation models.
For some models involving monolingual corpora, we implemented backtranslation.
arXiv Detail & Related papers (2022-04-19T19:04:05Z) - PSG@HASOC-Dravidian CodeMixFIRE2021: Pretrained Transformers for
Offensive Language Identification in Tanglish [0.0]
This paper describes the system submitted to Dravidian-Codemix-HASOC2021: Hate Speech and Offensive Language Identification in Dravidian languages.
This task aims to identify offensive content in code-mixed comments/posts in Dravidian languages collected from social media.
arXiv Detail & Related papers (2021-10-06T15:23:40Z) - Offensive Language Identification in Low-resourced Code-mixed Dravidian
languages using Pseudo-labeling [0.16252563723817934]
We classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam.
A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language.
We fine-tune several recent pretrained language models on the newly constructed dataset.
arXiv Detail & Related papers (2021-08-27T08:43:08Z) - The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task.
We trained our models with the officially provided ASR and MT datasets.
To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z) - Comparing Approaches to Dravidian Language Identification [4.284178873394113]
This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop.
The DLI training set includes 16,674 YouTube comments written in Roman script containing code-mixed text with English and one of the three South Dravidian languages: Kannada, Malayalam, and Tamil.
Our results reinforce the idea that deep learning methods are not as competitive in language identification related tasks as they are in many other text classification tasks.
arXiv Detail & Related papers (2021-03-09T16:58:55Z) - Hate-Alert@DravidianLangTech-EACL2021: Ensembling strategies for
Transformer-based Offensive language Detection [5.139400587753555]
Social media often acts as breeding grounds for different forms of offensive content.
We present an exhaustive exploration of different transformer models, We also provide a genetic algorithm technique for ensembling different models.
Our ensembled models trained separately for each language secured the first position in Tamil, the second position in Kannada, and the first position in Malayalam sub-tasks.
arXiv Detail & Related papers (2021-02-19T18:35:38Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.