Related papers: Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction

Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction

URL: http://arxiv.org/abs/2402.14521v1
Date: Thu, 22 Feb 2024 13:12:05 GMT
Title: Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction
Authors: Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam
Abstract summary: This paper presents our effort in the data acquisition, annotation methodology, and thorough analysis of the annotated dataset. We develop a dataset with 6,061 entities and 3,268 relation instances. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English.
Score: 1.9927672677487354
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard English and Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP) tasks on Malaysian English. Unfortunately, most of the existing datasets are mainly based on standard English and therefore inadequate for improving NLP tasks in Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions on Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian English. To the best of our knowledge, there is no annotated dataset available to improvise the model. To address these issues, we constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated that having a dataset tailor-made for Malaysian English could improve the performance of NER in Malaysian English significantly. This paper presents our effort in the data acquisition, annotation methodology, and thorough analysis of the annotated dataset. To validate the quality of the annotation, inter-annotator agreement was used, followed by adjudication of disagreements by a subject matter expert. Upon completion of these tasks, we managed to develop a dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup and analysis on the NER performance. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation extraction. The dataset and annotation guideline has been published on Github.

Related papers

HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark [54.73504952691398]
We set out to deliver a Hebrew Machine Reading dataset as extractive Questioning.<n>The morphologically rich nature of Hebrew poses a challenge to this endeavor.<n>We devise a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics.
arXiv Detail & Related papers (2025-08-03T15:53:01Z)
Bridging the Gap: Transfer Learning from English PLMs to Malaysian English [1.8241632171540025]
Malaysian English is a low resource creole language. Named Entity Recognition models underperform when capturing entities from Malaysian English text. We introduce MENmBERT and MENBERT, a pre-trained language model with contextual understanding.
arXiv Detail & Related papers (2024-07-01T15:26:03Z)
Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet. On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z)
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA) We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages. Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset. We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model. The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z)
IndoNLI: A Natural Language Inference Dataset for Indonesian [4.707529518839985]
IndoNLI is the first human-elicited NLI dataset for Indonesian. We collect nearly 18K sentence pairs annotated by crowd workers and experts.
arXiv Detail & Related papers (2021-10-27T16:37:13Z)
An Open-Source Dataset and A Multi-Task Model for Malay Named Entity Recognition [3.511753382329252]
We build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens) An auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways.
arXiv Detail & Related papers (2021-09-03T03:29:25Z)
Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
Multilingual Argument Mining: Datasets and Analysis [9.117984896907782]
We explore the potential of transfer learning using the multilingual BERT model to address argument mining tasks in non-English languages. We show that such methods are well suited for classifying the stance of arguments and detecting evidence, but less so for assessing the quality of arguments. We provide a human-generated dataset with more than 10k arguments in multiple languages, as well as machine translation of the English datasets.
arXiv Detail & Related papers (2020-10-13T14:49:10Z)
GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations. GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree. We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.