Related papers: Bangla Hate Speech Classification with Fine-tuned Transformer Models

Bangla Hate Speech Classification with Fine-tuned Transformer Models

URL: http://arxiv.org/abs/2512.02845v1
Date: Tue, 02 Dec 2025 14:56:58 GMT
Title: Bangla Hate Speech Classification with Fine-tuned Transformer Models
Authors: Yalda Keivan Jafari, Krishno Dey,
Abstract summary: We study Subtask 1A and Subtask 1B of the BLP 2025 Shared Task on hate speech detection.<n>We produce and consider Logistic Regression, Random Forest, and De- cision Tree as baseline methods.<n>We also uti- lized transformer-based models such as Dis- tilBERT, BanglaBERT, m-BERT, and XLM- RoBERTa for hate speech classification.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hate speech recognition in low-resource lan- guages remains a difficult problem due to in- sufficient datasets, orthographic heterogeneity, and linguistic variety. Bangla is spoken by more than 230 million people of Bangladesh and India (West Bengal). Despite the grow- ing need for automated moderation on social media platforms, Bangla is significantly under- represented in computational resources. In this work, we study Subtask 1A and Subtask 1B of the BLP 2025 Shared Task on hate speech detection. We reproduce the official base- lines (e.g., Majority, Random, Support Vec- tor Machine) and also produce and consider Logistic Regression, Random Forest, and De- cision Tree as baseline methods. We also uti- lized transformer-based models such as Dis- tilBERT, BanglaBERT, m-BERT, and XLM- RoBERTa for hate speech classification. All the transformer-based models outperformed base- line methods for the subtasks, except for Distil- BERT. Among the transformer-based models, BanglaBERT produces the best performance for both subtasks. Despite being smaller in size, BanglaBERT outperforms both m-BERT and XLM-RoBERTa, which suggests language- specific pre-training is very important. Our results highlight the potential and need for pre- trained language models for the low-resource Bangla language.

Related papers

LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target [27.786707138241493]
We introduce the first multi-task Bangla hate-speech dataset, BanglaMultiHate, one of the largest manually annotated corpus to date.<n>We compare classical baselines, monolingual pretrained models, and LLMs under zero-shot prompting and LoRA fine-tuning.<n>Our experiments assess LLM adaptability in a low-resource setting and reveal a consistent trend. Although LoRA-tuned LLMs are competitive with BanglaBERT, culturally and linguistically grounded pretraining remains critical for robust performance.
arXiv Detail & Related papers (2025-10-02T13:17:11Z)
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods. We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z)
BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models for Sentiment Analysis of Bangla Social Media Posts [0.46040036610482665]
This paper presents our submission to Task 2 (Sentiment Analysis of Bangla Social Media Posts) of the BLP Workshop. Our quantitative results show that transfer learning really helps in better learning of the models in this low-resource language scenario. We obtain a micro-F1 of 67.02% on the test set and our performance in this shared task is ranked at 21 in the leaderboard.
arXiv Detail & Related papers (2023-10-13T16:46:38Z)
ML-SUPERB: Multilingual Speech Universal PERformance Benchmark [94.64616634862995]
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks.<n>This paper presents multilingual SUPERB, covering 143 languages (ranging from high-resource to endangered), and considering both automatic speech recognition and language identification.<n>Similar to the SUPERB benchmark, we find speech SSL models can significantly improve performance compared to FBANK features.
arXiv Detail & Related papers (2023-05-18T00:01:27Z)
NarrowBERT: Accelerating Masked Language Model Pretraining and Inference [50.59811343945605]
We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2times$. NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining. We show that NarrowBERT increases the throughput at inference time by as much as $3.5times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI.
arXiv Detail & Related papers (2023-01-11T23:45:50Z)
Bangla-Wave: Improving Bangla Automatic Speech Recognition Utilizing N-gram Language Models [0.0]
We show how to significantly improve the performance of an ASR model by adding an n-gram language model as a post-processor. We generate a robust Bangla ASR model that is better than the existing ASR models.
arXiv Detail & Related papers (2022-09-13T17:59:21Z)
Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language. We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening. For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z)
FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances. We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID. The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z)
Neural Models for Offensive Language Detection [0.0]
Offensive language detection is an ever-growing natural language processing (NLP) application. We believe contributing to improving and comparing different machine learning models to fight such harmful contents is an important and challenging goal for this thesis.
arXiv Detail & Related papers (2021-05-30T13:02:45Z)
BanglaBERT: Combating Embedding Barrier for Low-Resource Language Understanding [1.7000879291900044]
We build a Bangla natural language understanding model pre-trained on 18.6 GB data we crawled from top Bangla sites on the internet. Our model outperforms multilingual baselines and previous state-of-the-art results by 1-6%. We identify a major shortcoming of multilingual models that hurt performance for low-resource languages that don't share writing scripts with any high resource one.
arXiv Detail & Related papers (2021-01-01T09:28:45Z)
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models. We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.