Related papers: Bridging the Gap: Transfer Learning from English PLMs to Malaysian English

Bridging the Gap: Transfer Learning from English PLMs to Malaysian English

URL: http://arxiv.org/abs/2407.01374v1
Date: Mon, 1 Jul 2024 15:26:03 GMT
Title: Bridging the Gap: Transfer Learning from English PLMs to Malaysian English
Authors: Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam,
Abstract summary: Malaysian English is a low resource creole language. Named Entity Recognition models underperform when capturing entities from Malaysian English text. We introduce MENmBERT and MENBERT, a pre-trained language model with contextual understanding.
Score: 1.8241632171540025
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Malaysian English is a low resource creole language, where it carries the elements of Malay, Chinese, and Tamil languages, in addition to Standard English. Named Entity Recognition (NER) models underperform when capturing entities from Malaysian English text due to its distinctive morphosyntactic adaptations, semantic features and code-switching (mixing English and Malay). Considering these gaps, we introduce MENmBERT and MENBERT, a pre-trained language model with contextual understanding, specifically tailored for Malaysian English. We have fine-tuned MENmBERT and MENBERT using manually annotated entities and relations from the Malaysian English News Article (MEN) Dataset. This fine-tuning process allows the PLM to learn representations that capture the nuances of Malaysian English relevant for NER and RE tasks. MENmBERT achieved a 1.52\% and 26.27\% improvement on NER and RE tasks respectively compared to the bert-base-multilingual-cased model. Although the overall performance of NER does not have a significant improvement, our further analysis shows that there is a significant improvement when evaluated by the 12 entity labels. These findings suggest that pre-training language models on language-specific and geographically-focused corpora can be a promising approach for improving NER performance in low-resource settings. The dataset and code published in this paper provide valuable resources for NLP research work focusing on Malaysian English.

Related papers

FormosanBench: Benchmarking Low-Resource Austronesian Languages in the Era of Large Language Models [1.2403152094314245]
We introduce FORMOSANBENCH, the first benchmark for evaluating large language models (LLMs) on low-resource Austronesian languages.<n>We assess model performance in zero-shot, 10-shot, and fine-tuned settings using FORMOSANBENCH.<n>Our results reveal a substantial performance gap between high-resource and Formosan languages.
arXiv Detail & Related papers (2025-06-12T07:02:28Z)
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z)
Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance [6.907734681124986]
This paper strategically identifies the need for linguistic equity by examining several knowledge editing techniques in multilingual contexts. We evaluate the performance of models such as Mistral, TowerInstruct, OpenHathi, Tamil-Llama, and Kan-Llama across languages including English, German, French, Italian, Spanish, Hindi, Tamil, and Kannada.
arXiv Detail & Related papers (2024-06-17T01:54:27Z)
Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction [1.9927672677487354]
This paper presents our effort in the data acquisition, annotation methodology, and thorough analysis of the annotated dataset. We develop a dataset with 6,061 entities and 3,268 relation instances. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English.
arXiv Detail & Related papers (2024-02-22T13:12:05Z)
MaLLaM -- Malaysia Large Language Model [0.0]
We trained models with 1.1 billion, 3 billion, and 5 billion parameters on a substantial 349GB dataset. MaLLaM contributes to enhanced natural language understanding and generation tasks in the Malay language.
arXiv Detail & Related papers (2024-01-26T06:56:05Z)
SeaLLMs -- Large Language Models for Southeast Asia [76.50157503379086]
We introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning. Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities.
arXiv Detail & Related papers (2023-12-01T17:17:56Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages. Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline. Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z)
IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP [41.57622648924415]
The Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. We release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM.
arXiv Detail & Related papers (2020-11-02T01:54:56Z)
Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns. We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations. We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.