Related papers: 1-800-SHARED-TASKS @ NLU of Devanagari Script Languages: Detection of Language, Hate Speech, and Targets using LLMs

1-800-SHARED-TASKS @ NLU of Devanagari Script Languages: Detection of Language, Hate Speech, and Targets using LLMs

URL: http://arxiv.org/abs/2411.06850v1
Date: Mon, 11 Nov 2024 10:34:36 GMT
Title: 1-800-SHARED-TASKS @ NLU of Devanagari Script Languages: Detection of Language, Hate Speech, and Targets using LLMs
Authors: Jebish Purbey, Siddartha Pullakhandam, Kanwal Mehreen, Muhammad Arham, Drishti Sharma, Ashay Srivastava, Ram Mohan Rao Kadiyala,
Abstract summary: This paper presents a detailed system description of our entry for the CHiPSAL 2025 shared task. We focus on language detection, hate speech identification, and target detection in Devanagari script languages.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper presents a detailed system description of our entry for the CHiPSAL 2025 shared task, focusing on language detection, hate speech identification, and target detection in Devanagari script languages. We experimented with a combination of large language models and their ensembles, including MuRIL, IndicBERT, and Gemma-2, and leveraged unique techniques like focal loss to address challenges in the natural understanding of Devanagari languages, such as multilingual processing and class imbalance. Our approach achieved competitive results across all tasks: F1 of 0.9980, 0.7652, and 0.6804 for Sub-tasks A, B, and C respectively. This work provides insights into the effectiveness of transformer models in tasks with domain-specific and linguistic challenges, as well as areas for potential improvement in future iterations.

Related papers

IITR-CIOL@NLU of Devanagari Script Languages 2025: Multilingual Hate Speech Detection and Target Identification in Devanagari-Scripted Languages [0.0]
This work focuses on two subtasks related to hate speech detection and target identification in Devanagari-scripted languages. Subtask B involves detecting hate speech in online text, while Subtask C requires identifying the specific targets of hate speech. We propose the MultilingualRobertaClass model, a deep neural network built on the pretrained multilingual transformer model ia-multilingual-transliterated-roberta.
arXiv Detail & Related papers (2024-12-23T19:58:11Z)
LLMsAgainstHate @ NLU of Devanagari Script Languages 2025: Hate Speech Detection and Target Identification in Devanagari Languages via Parameter Efficient Fine-Tuning of LLMs [9.234570108225187]
We propose an Efficient Fine tuning (PEFT) based solution for hate speech detection and target identification. We evaluate multiple LLMs on the Devanagari dataset provided by (Thapa et al., 2025) Results demonstrate the efficacy of our approach in handling Devanagari-scripted content.
arXiv Detail & Related papers (2024-12-22T18:38:24Z)
NLPineers@ NLU of Devanagari Script Languages 2025: Hate Speech Detection using Ensembling of BERT-based models [0.9974630621313314]
This paper focuses on hate speech detection in Devanagari-scripted languages, focusing on Hindi and Nepali. Using a range of transformer-based models, we examine their effectiveness in navigating the nuanced boundary between hate speech and free expression. This work emphasizes the need for hate speech detection in Devanagari-scripted languages and presents a foundation for further research.
arXiv Detail & Related papers (2024-12-11T07:37:26Z)
Hate Speech and Offensive Content Detection in Indo-Aryan Languages: A Battle of LSTM and Transformers [0.0]
We conduct a comparative analysis of hate speech classification across five distinct languages: Bengali, Assamese, Bodo, Sinhala, and Gujarati. Bert Base Multilingual Cased emerges as a strong performer across languages, achieving an F1 score of 0.67027 for Bengali and 0.70525 for Assamese. In Sinhala, XLM-R stands out with an F1 score of 0.83493, whereas for Gujarati, a custom LSTM-based model outshined with an F1 score of 0.76601.
arXiv Detail & Related papers (2023-12-09T20:24:00Z)
On the Off-Target Problem of Zero-Shot Multilingual Neural Machine Translation [104.85258654917297]
We find that failing in encoding discriminative target language signal will lead to off-target and a closer lexical distance. We propose Language Aware Vocabulary Sharing (LAVS) to construct the multilingual vocabulary. We conduct experiments on a multilingual machine translation benchmark in 11 languages.
arXiv Detail & Related papers (2023-05-18T12:43:31Z)
Efficiently Aligned Cross-Lingual Transfer Learning for Conversational Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks. We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset. To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z)
No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z)
Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces [15.457286059556393]
Hate speech and profanity detection suffer from data sparsity, especially for languages other than English. We identify profane subspaces in word and sentence representations and explore their generalization capability. We observe that, on both similar and distant target tasks and across all languages, the subspace-based representations transfer more effectively than standard BERT representations.
arXiv Detail & Related papers (2021-06-14T15:34:37Z)
TransWiC at SemEval-2021 Task 2: Transformer-based Multilingual and Cross-lingual Word-in-Context Disambiguation [0.8883733362171032]
Our approach is based on pretrained transformer models and does not use any language-specific processing and resources. Our best model achieves 0.90 accuracy for English-English subtask which is very compatible compared to the best result of the subtask; 0.93 accuracy. Our approach also achieves satisfactory results in other monolingual and cross-lingual language pairs as well.
arXiv Detail & Related papers (2021-04-09T23:06:05Z)
Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages. We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC) LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language. We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
Cross-Lingual Transfer Learning for Complex Word Identification [0.3437656066916039]
Complex Word Identification (CWI) is a task centered on detecting hard-to-understand words in texts. Our approach uses zero-shot, one-shot, and few-shot learning techniques, alongside state-of-the-art solutions for Natural Language Processing (NLP) tasks. Our aim is to provide evidence that the proposed models can learn the characteristics of complex words in a multilingual environment.
arXiv Detail & Related papers (2020-10-02T17:09:47Z)
FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
UPB at SemEval-2020 Task 9: Identifying Sentiment in Code-Mixed Social Media Texts using Transformers and Multi-Task Learning [1.7196613099537055]
We describe the systems developed by our team for SemEval-2020 Task 9. We aim to cover two well-known code-mixed languages: Hindi-English and Spanish-English. Our approach achieves promising performance on the Hindi-English task, with an average F1-score of 0.6850. For the Spanish-English task, we obtained an average F1-score of 0.7064 ranking our team 17th out of 29 participants.
arXiv Detail & Related papers (2020-09-06T17:19:18Z)
Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models. Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.