Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala
- URL: http://arxiv.org/abs/2504.02178v1
- Date: Wed, 02 Apr 2025 23:46:49 GMT
- Title: Subasa -- Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala
- Authors: Shanilka Haturusinghe, Tharindu Cyril Weerasooriya, Marcos Zampieri, Christopher M. Homan, S. R. Liyanage,
- Abstract summary: We introduce four models: "Subasa-XLM-R", which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction.<n>We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection.
- Score: 9.298909305675595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate detection of offensive language is essential for a number of applications related to social media safety. There is a sharp contrast in performance in this task between low and high-resource languages. In this paper, we adapt fine-tuning strategies that have not been previously explored for Sinhala in the downstream task of offensive language detection. Using this approach, we introduce four models: "Subasa-XLM-R", which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction. Two variants of "Subasa-Llama" and "Subasa-Mistral", are fine-tuned versions of Llama (3.2) and Mistral (v0.3), respectively, with a task-specific strategy. We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection. All our models outperform existing baselines. Subasa-XLM-R achieves the highest Macro F1 score (0.84) surpassing state-of-the-art large language models like GPT-4o when evaluated on the same SOLD benchmark dataset under zero-shot settings. The models and code are publicly available.
Related papers
- Pre-training a Transformer-Based Generative Model Using a Small Sepedi Dataset [0.5530212768657544]
We use the Sepedi monolingual (SepMono) dataset from several South African resources and the Sepedi radio news (SepNews) dataset from the radio news domain.<n>Our results show that the non-occlusion models perform better compared to the occlusion-based models when measuring validation loss and perplexity.
arXiv Detail & Related papers (2025-01-25T17:25:06Z) - Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning [0.4194295877935868]
This study investigates the effects of Low-Rank Adaptation (LoRA) -Efficient Fine-Tuning (PEFT) on multilingual Gemma models for Marathi.<n>Using a translated dataset with 52,000 instruction-response pairs, our findings reveal that while evaluation performance decline post-fine-tuning, manual assessments frequently suggest that the fine-tuned models outperform their original counterparts.
arXiv Detail & Related papers (2024-11-27T18:14:38Z) - How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario [72.02391485962127]
Speech Self-Supervised Learning (SSL) models achieve impressive performance on Automatic Speech Recognition (ASR)<n>In low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages.<n>We extend a conventional efficient fine-tuning scheme based on the adapter to handle these issues.
arXiv Detail & Related papers (2024-11-27T10:51:00Z) - Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models.
We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design.
Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z) - Data-Augmentation-Based Dialectal Adaptation for LLMs [26.72394783468532]
This report presents GMUNLP's participation to the Dialect-Copa shared task at VarDial 2024.
The task focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects.
We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance.
arXiv Detail & Related papers (2024-04-11T19:15:32Z) - Machine Translation for Ge'ez Language [0.0]
Machine translation for low-resource languages such as Ge'ez faces challenges such as out-of-vocabulary words, domain mismatches, and lack of labeled training data.
We develop a multilingual neural machine translation (MNMT) model based on languages relatedness.
We also experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches.
arXiv Detail & Related papers (2023-11-24T14:55:23Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.