Punctuation Restoration for Singaporean Spoken Languages: English,
Malay, and Mandarin
- URL: http://arxiv.org/abs/2212.05356v1
- Date: Sat, 10 Dec 2022 19:54:53 GMT
- Title: Punctuation Restoration for Singaporean Spoken Languages: English,
Malay, and Mandarin
- Authors: Abhinav Rao, Ho Thi-Nga, Chng Eng-Siong
- Abstract summary: This paper presents the work of restoring punctuation for ASR transcripts generated by multilingual ASR systems.
The focus languages are English, Mandarin, and Malay which are three of the most popular languages in Singapore.
To the best of our knowledge, this is the first system that can tackle punctuation restoration for these three languages simultaneously.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the work of restoring punctuation for ASR transcripts
generated by multilingual ASR systems. The focus languages are English,
Mandarin, and Malay which are three of the most popular languages in Singapore.
To the best of our knowledge, this is the first system that can tackle
punctuation restoration for these three languages simultaneously. Traditional
approaches usually treat the task as a sequential labeling task, however, this
work adopts a slot-filling approach that predicts the presence and type of
punctuation marks at each word boundary. The approach is similar to the
Masked-Language Model approach employed during the pre-training stages of BERT,
but instead of predicting the masked word, our model predicts masked
punctuation. Additionally, we find that using Jieba1 instead of only using the
built-in SentencePiece tokenizer of XLM-R can significantly improve the
performance of punctuating Mandarin transcripts. Experimental results on
English and Mandarin IWSLT2022 datasets and Malay News show that the proposed
approach achieved state-of-the-art results for Mandarin with 73.8% F1-score
while maintaining a reasonable F1-score for English and Malay, i.e. 74.7% and
78% respectively. Our source code that allows reproducing the results and
building a simple web-based application for demonstration purposes is available
on Github.
Related papers
- Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - NU HLT at CMCL 2022 Shared Task: Multilingual and Crosslingual
Prediction of Human Reading Behavior in Universal Language Space [0.0]
The secret behind the success of this model is in the preprocessing step where all words are transformed to their universal language representation via the International Phonetic Alphabet (IPA)
A finetuned Random Forest model obtained best performance for both tasks with 3.8031 and 3.9065 MAE scores for mean first fixation duration (FFDAve) and mean total reading time (TRTAve) respectively.
arXiv Detail & Related papers (2022-02-22T12:39:16Z) - Multilingual AMR Parsing with Noisy Knowledge Distillation [68.01173640691094]
We study multilingual AMR parsing from the perspective of knowledge distillation, where the aim is to learn and improve a multilingual AMR by using an existing English as its teacher.
We identify that noisy input and precise output are the key to successful distillation.
arXiv Detail & Related papers (2021-09-30T15:13:48Z) - Simple and Effective Zero-shot Cross-lingual Phoneme Recognition [46.76787843369816]
This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages.
Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures.
arXiv Detail & Related papers (2021-09-23T22:50:32Z) - The Effectiveness of Intermediate-Task Training for Code-Switched
Natural Language Understanding [15.54831836850549]
We propose the use of bilingual intermediate pretraining as a reliable technique to derive performance gains on three different NLP tasks using code-switched text.
We achieve substantial absolute improvements of 7.87%, 20.15%, and 10.99%, on the mean accuracies and F1 scores over previous state-of-the-art systems.
We show consistent performance gains on four different code-switched language-pairs (Hindi-English, Spanish-English, Tamil-English and Malayalam-English) for SA.
arXiv Detail & Related papers (2021-07-21T08:10:59Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Cross-Lingual Transfer Learning for Complex Word Identification [0.3437656066916039]
Complex Word Identification (CWI) is a task centered on detecting hard-to-understand words in texts.
Our approach uses zero-shot, one-shot, and few-shot learning techniques, alongside state-of-the-art solutions for Natural Language Processing (NLP) tasks.
Our aim is to provide evidence that the proposed models can learn the characteristics of complex words in a multilingual environment.
arXiv Detail & Related papers (2020-10-02T17:09:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.