Malaysian English News Decoded: A Linguistic Resource for Named Entity
and Relation Extraction
- URL: http://arxiv.org/abs/2402.14521v1
- Date: Thu, 22 Feb 2024 13:12:05 GMT
- Title: Malaysian English News Decoded: A Linguistic Resource for Named Entity
and Relation Extraction
- Authors: Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam
- Abstract summary: This paper presents our effort in the data acquisition, annotation methodology, and thorough analysis of the annotated dataset.
We develop a dataset with 6,061 entities and 3,268 relation instances.
This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English.
- Score: 1.9927672677487354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Standard English and Malaysian English exhibit notable differences, posing
challenges for natural language processing (NLP) tasks on Malaysian English.
Unfortunately, most of the existing datasets are mainly based on standard
English and therefore inadequate for improving NLP tasks in Malaysian English.
An experiment using state-of-the-art Named Entity Recognition (NER) solutions
on Malaysian English news articles highlights that they cannot handle
morphosyntactic variations in Malaysian English. To the best of our knowledge,
there is no annotated dataset available to improvise the model. To address
these issues, we constructed a Malaysian English News (MEN) dataset, which
contains 200 news articles that are manually annotated with entities and
relations. We then fine-tuned the spaCy NER tool and validated that having a
dataset tailor-made for Malaysian English could improve the performance of NER
in Malaysian English significantly. This paper presents our effort in the data
acquisition, annotation methodology, and thorough analysis of the annotated
dataset. To validate the quality of the annotation, inter-annotator agreement
was used, followed by adjudication of disagreements by a subject matter expert.
Upon completion of these tasks, we managed to develop a dataset with 6,061
entities and 3,268 relation instances. Finally, we discuss on spaCy fine-tuning
setup and analysis on the NER performance. This unique dataset will contribute
significantly to the advancement of NLP research in Malaysian English, allowing
researchers to accelerate their progress, particularly in NER and relation
extraction. The dataset and annotation guideline has been published on Github.
Related papers
- Bridging the Gap: Transfer Learning from English PLMs to Malaysian English [1.8241632171540025]
Malaysian English is a low resource creole language.
Named Entity Recognition models underperform when capturing entities from Malaysian English text.
We introduce MENmBERT and MENBERT, a pre-trained language model with contextual understanding.
arXiv Detail & Related papers (2024-07-01T15:26:03Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z) - IndoNLI: A Natural Language Inference Dataset for Indonesian [4.707529518839985]
IndoNLI is the first human-elicited NLI dataset for Indonesian.
We collect nearly 18K sentence pairs annotated by crowd workers and experts.
arXiv Detail & Related papers (2021-10-27T16:37:13Z) - An Open-Source Dataset and A Multi-Task Model for Malay Named Entity
Recognition [3.511753382329252]
We build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens)
An auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways.
arXiv Detail & Related papers (2021-09-03T03:29:25Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Multilingual Argument Mining: Datasets and Analysis [9.117984896907782]
We explore the potential of transfer learning using the multilingual BERT model to address argument mining tasks in non-English languages.
We show that such methods are well suited for classifying the stance of arguments and detecting evidence, but less so for assessing the quality of arguments.
We provide a human-generated dataset with more than 10k arguments in multiple languages, as well as machine translation of the English datasets.
arXiv Detail & Related papers (2020-10-13T14:49:10Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.