Related papers: Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text

Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text

URL: http://arxiv.org/abs/2510.03683v2
Date: Fri, 10 Oct 2025 05:42:06 GMT
Title: Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text
Authors: Nisar Hussain, Amna Qasim, Gull Mehak, Muhammad Zain, Momina Hafeez, Grigori Sidorov,
Abstract summary: We propose a QLoRA based fine tuning framework to improve offensive language detection in Roman Urdu-English text.<n>We translate the Roman Urdu-English code mixed dataset into English using Google Translate to leverage English LLMs.<n>We fine tuned several transformers and large language models, including Meta LLaMA 3 8B, Mistral 7B v0.1, LLaMA 2 7B, ModernBERT, and RoBERTa.
Score: 5.908448629364552
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The use of derogatory terms in languages that employ code mixing, such as Roman Urdu, presents challenges for Natural Language Processing systems due to unstated grammar, inconsistent spelling, and a scarcity of labeled data. In this work, we propose a QLoRA based fine tuning framework to improve offensive language detection in Roman Urdu-English text. We translated the Roman Urdu-English code mixed dataset into English using Google Translate to leverage English LLMs, while acknowledging that this translation reduces direct engagement with code mixing features. Our focus is on classification performance using English translated low resource inputs. We fine tuned several transformers and large language models, including Meta LLaMA 3 8B, Mistral 7B v0.1, LLaMA 2 7B, ModernBERT, and RoBERTa, with QLoRA for memory efficient adaptation. Models were trained and evaluated on a manually annotated Roman Urdu dataset for offensive vs non offensive content. Of all tested models, the highest F1 score of 91.45 was attained by Meta LLaMA 3 8B, followed by Mistral 7B at 89.66, surpassing traditional transformer baselines. These results demonstrate the efficacy of QLoRA in fine tuning high performing models for low resource environments such as code mixed offensive language detection, and confirm the potential of LLMs for this task. This work advances a scalable approach to Roman Urdu moderation and paves the way for future multilingual offensive detection systems based on LLMs.

Related papers

It's All About the Confidence: An Unsupervised Approach for Multilingual Historical Entity Linking using Large Language Models [1.6407393639625105]
MHEL-LLaMo is an unsupervised ensemble approach combining a Small Language Model (SLM) and an LLM.<n>We evaluate MHEL-LLaMo on four established benchmarks in six European languages.<n>Results demonstrate that MHEL-LLaMo outperforms state-of-the-art models without requiring fine-tuning.
arXiv Detail & Related papers (2026-01-13T12:36:38Z)
Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language [1.4206084598312039]
As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models.<n>We introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers.<n>Experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma.
arXiv Detail & Related papers (2025-10-10T06:07:14Z)
Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages [4.702593857707973]
Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training.<n>We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM.<n>We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam.
arXiv Detail & Related papers (2025-08-12T17:17:13Z)
Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models [0.6554326244334868]
Transliteration between Urdu and its Romanized form, Roman Urdu, remains underexplored.<n>We propose a transformer-based approach using the m2m100 multilingual translation model.<n>Our model achieves strong transliteration performance, with Char-BLEU scores of 96.37 for Urdu->Roman-Urdu and 97.44 for Roman-Urdu->Urdu.
arXiv Detail & Related papers (2025-03-27T14:18:50Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Why Not Transform Chat Large Language Models to Non-English? [60.200490339844634]
The scarcity of non-English data limits the development of non-English large language models (LLMs)<n>TransLLM divides the transfer problem into some common sub-tasks with the translation chain-of-thought.<n>Our method, using only single-turn data, outperforms strong baselines and ChatGPT on multi-turn benchmark MT-bench.
arXiv Detail & Related papers (2024-05-22T18:53:25Z)
The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency. We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes. It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z)
Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages. In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z)
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
Chain-of-Dictionary Prompting Elicits Translation in Large Language Models [100.47154959254937]
Large language models (LLMs) have shown surprisingly good performance in multilingual neural machine translation (MNMT) We present a novel method, CoD, which augments LLMs with prior knowledge with the chains of multilingual dictionaries for a subset of input words to elicit translation abilities.
arXiv Detail & Related papers (2023-05-11T05:19:47Z)
Does Transliteration Help Multilingual Language Modeling? [0.0]
We empirically measure the effect of transliteration on Multilingual Language Models. We focus on the Indic languages, which have the highest script diversity in the world. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages.
arXiv Detail & Related papers (2022-01-29T05:48:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.