Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A
Case Study in Taiwanese Hokkien
- URL: http://arxiv.org/abs/2301.08937v1
- Date: Sat, 21 Jan 2023 11:04:20 GMT
- Title: Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A
Case Study in Taiwanese Hokkien
- Authors: Sin-En Lu, Bo-Han Lu, Chao-Yi Lu, Richard Tzong-Han Tsai
- Abstract summary: In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants.
We propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method.
- Score: 5.272372029223681
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In natural language processing (NLP), code-mixing (CM) is a challenging task,
especially when the mixed languages include dialects. In Southeast Asian
countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the
most widespread code-mixed language pair among Chinese immigrants, and it is
also common in Taiwan. However, dialects such as Hokkien often have a scarcity
of resources and the lack of an official writing system, limiting the
development of dialect CM research. In this paper, we propose a method to
construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome
the morphological issue under the Sino-Tibetan language family, and offer an
efficient Hokkien word segmentation method through a linguistics-based toolkit.
Furthermore, we use our proposed dataset and employ transfer learning to train
the XLM (cross-lingual language model) for translation tasks. To fit the
code-mixing scenario, we adapt XLM slightly. We found that by using linguistic
knowledge, rules, and language tags, the model produces good results on CM data
translation while maintaining monolingual translation quality.
Related papers
- IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian
Local Languages [62.60787450345489]
We explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay.
Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing.
arXiv Detail & Related papers (2023-11-21T07:50:53Z) - Simple yet Effective Code-Switching Language Identification with
Multitask Pre-Training and Transfer Learning [0.7242530499990028]
Code-switching is the linguistics phenomenon where in casual settings, multilingual speakers mix words from different languages in one utterance.
We propose two novel approaches toward improving language identification accuracy on an English-Mandarin child-directed speech dataset.
Our best model achieves a balanced accuracy of 0.781 on a real English-Mandarin code-switching child-directed speech corpus and outperforms the previous baseline by 55.3%.
arXiv Detail & Related papers (2023-05-31T11:43:16Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - CoLI-Machine Learning Approaches for Code-mixed Language Identification
at the Word Level in Kannada-English Texts [0.0]
Many Indians especially youths are comfortable with Hindi and English, in addition to their local languages. Hence, they often use more than one language to post their comments on social media.
Code-mixed Kn-En texts are extracted from YouTube video comments to construct CoLI-Kenglish dataset and code-mixed Kn-En embedding.
The words in CoLI-Kenglish dataset are grouped into six major categories, namely, "Kannada", "English", "Mixed-language", "Name", "Location" and "Other.
arXiv Detail & Related papers (2022-11-17T19:16:56Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Word Level Language Identification in English Telugu Code Mixed Data [7.538482310185133]
Intrasentential Code Switching (ICS) or Code Mixing (CM) is frequently observed nowadays.
We present a study of various models - Nave Bayes, Random Forest, Conditional Random Field (CRF), and Hidden Markov Model (HMM) for Language Identification.
Our best performing system is CRF-based with an f1-score of 0.91.
arXiv Detail & Related papers (2020-10-09T10:15:06Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Can Multilingual Language Models Transfer to an Unseen Dialect? A Case
Study on North African Arabizi [2.76240219662896]
We study the ability of multilingual language models to process an unseen dialect.
We take user generated North-African Arabic as our case study.
We show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect.
arXiv Detail & Related papers (2020-05-01T11:29:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.