Related papers: muBoost: An Effective Method for Solving Indic Multilingual Text Classification Problem

muBoost: An Effective Method for Solving Indic Multilingual Text Classification Problem

URL: http://arxiv.org/abs/2206.10280v1
Date: Tue, 21 Jun 2022 12:06:03 GMT
Title: muBoost: An Effective Method for Solving Indic Multilingual Text Classification Problem
Authors: Manish Pathak, Aditya Jain
Abstract summary: We are presenting our solution to Multilingual Abusive Comment Identification Problem on Moj. The problem dealt with detecting abusive comments, in 13 regional Indic languages. We were able to achieve a mean F1-score of 89.286 on the test data, an improvement over baseline MURIL model with a F1-score of 87.48.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text Classification is an integral part of many Natural Language Processing tasks such as sarcasm detection, sentiment analysis and many more such applications. Many e-commerce websites, social-media/entertainment platforms use such models to enhance user-experience to generate traffic and thus, revenue on their platforms. In this paper, we are presenting our solution to Multilingual Abusive Comment Identification Problem on Moj, an Indian video-sharing social networking service, powered by ShareChat. The problem dealt with detecting abusive comments, in 13 regional Indic languages such as Hindi, Telugu, Kannada etc., on the videos on Moj platform. Our solution utilizes the novel muBoost, an ensemble of CatBoost classifier models and Multilingual Representations for Indian Languages (MURIL) model, to produce SOTA performance on Indic text classification tasks. We were able to achieve a mean F1-score of 89.286 on the test data, an improvement over baseline MURIL model with a F1-score of 87.48.

Related papers

Creating and Evaluating Code-Mixed Nepali-English and Telugu-English Datasets for Abusive Language Detection Using Traditional and Deep Learning Models [1.835004446596942]
We introduce a novel, manually annotated dataset of 2 thousand Telugu-English and 5 Nepali-English code-mixed comments.<n>The dataset undergoes rigorous preprocessing before being evaluated across multiple Machine Learning (ML), Deep Learning (DL), and Large Language Models (LLMs)<n>Our findings provide key insights into the challenges of detecting abusive language in code-mixed settings.
arXiv Detail & Related papers (2025-04-23T11:29:10Z)
Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages [0.0]
In multilingual societies like India, text often exhibits code-mixing, blending local languages with English at different linguistic levels. This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages. In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories.
arXiv Detail & Related papers (2024-11-06T16:20:37Z)
Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India. It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z)
ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning [70.57126720079971]
Large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP) This paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources. Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages.
arXiv Detail & Related papers (2023-04-12T05:08:52Z)
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw. At the heart of the approach is a single multilingual token-free Charformer model. We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z)
Toxicity Detection for Indic Multilingual Social Media Content [0.0]
This paper describes the system proposed by team 'Moj Masti' using the data provided by ShareChat/Moj in emphIIIT-D Abusive Comment Identification challenge. We focus on how we can leverage multilingual transformer based pre-trained and fine-tuned models to approach code-mixed/code-switched classification tasks.
arXiv Detail & Related papers (2022-01-03T12:01:47Z)
Ceasing hate withMoH: Hate Speech Detection in Hindi-English Code-Switched Language [2.9926023796813728]
This work focuses on analyzing hate speech in Hindi-English code-switched language. To contain the structure of data, we developed MoH or Map Only Hindi, which means "Love" in Hindi. MoH pipeline consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words.
arXiv Detail & Related papers (2021-10-18T15:24:32Z)
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding. We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset. In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z)
Role of Artificial Intelligence in Detection of Hateful Speech for Hinglish Data on Social Media [1.8899300124593648]
Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world. Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages. We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
arXiv Detail & Related papers (2021-05-11T10:02:28Z)
Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages [0.8155575318208631]
Language models based on the Transformer architecture have achieved state-of-the-art performance on a wide range of NLP tasks. However, this performance is usually tested and reported on high-resource languages, like English, French, Spanish, and German. Indian languages, on the other hand, are underrepresented in such benchmarks.
arXiv Detail & Related papers (2020-11-04T14:43:43Z)
Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models. Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.