Limpeh ga li gong: Challenges in Singlish Annotations
- URL: http://arxiv.org/abs/2410.16156v1
- Date: Mon, 21 Oct 2024 16:21:45 GMT
- Title: Limpeh ga li gong: Challenges in Singlish Annotations
- Authors: Lynnette Hui Xian Ng, Luo Qi Chan,
- Abstract summary: We work on a fundamental Natural Language Processing task: Parts-Of-Speech (POS) tagging of Singlish sentences.
For our analysis, we build a parallel Singlish dataset containing direct English translations and POS tags, with translation and POS annotation done by native Singlish speakers.
Experiments show that automatic transition- and transformer-based taggers perform with only $sim 80%$ accuracy when evaluated against human-annotated POS labels.
- Score: 1.3812010983144802
- License:
- Abstract: Singlish, or Colloquial Singapore English, is a language formed from oral and social communication within multicultural Singapore. In this work, we work on a fundamental Natural Language Processing (NLP) task: Parts-Of-Speech (POS) tagging of Singlish sentences. For our analysis, we build a parallel Singlish dataset containing direct English translations and POS tags, with translation and POS annotation done by native Singlish speakers. Our experiments show that automatic transition- and transformer- based taggers perform with only $\sim 80\%$ accuracy when evaluated against human-annotated POS labels, suggesting that there is indeed room for improvement on computation analysis of the language. We provide an exposition of challenges in Singlish annotation: its inconsistencies in form and semantics, the highly context-dependent particles of the language, its structural unique expressions, and the variation of the language on different mediums. Our task definition, resultant labels and results reflects the challenges in analysing colloquial languages formulated from a variety of dialects, and paves the way for future studies beyond POS tagging.
Related papers
- Disentangling Singlish Discourse Particles with Task-Driven Representation [1.3812010983144802]
Singlish, or formally Colloquial Singapore English, is an English-based creole language originating from the SouthEast Asian country Singapore.
A fundamental task to understanding Singlish is to first understand the pragmatic functions of its discourse particles.
This work offers a preliminary effort to disentangle the Singlish discourse particles with task-driven representation learning.
After disentanglement, we cluster these discourse particles to differentiate their pragmatic functions, and perform Singlish-to-English machine translation.
arXiv Detail & Related papers (2024-09-30T15:04:17Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Augmenting Part-of-speech Tagging with Syntactic Information for
Vietnamese and Chinese [0.32228025627337864]
We implement the idea to improve word segmentation and part of speech tagging of the Vietnamese language by employing a simplified constituency.
Our neural model for joint word segmentation and part-of-speech tagging has the architecture of the syllable-based constituency.
This model can be augmented with predicted word boundary and part-of-speech tags by other tools.
arXiv Detail & Related papers (2021-02-24T08:57:02Z) - Verb Knowledge Injection for Multilingual Event Processing [50.27826310460763]
We investigate whether injecting explicit information on verbs' semantic-syntactic behaviour improves the performance of LM-pretrained Transformers.
We first demonstrate that injecting verb knowledge leads to performance gains in English event extraction.
We then explore the utility of verb adapters for event extraction in other languages.
arXiv Detail & Related papers (2020-12-31T03:24:34Z) - Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences.
Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem.
We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z) - Investigating Language Impact in Bilingual Approaches for Computational
Language Documentation [28.838960956506018]
This paper investigates how the choice of translation language affects the posterior documentation work.
We create 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment.
Our results suggest that incorporating clues into the neural models' input representation increases their translation and alignment quality.
arXiv Detail & Related papers (2020-03-30T10:30:34Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.