Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus
- URL: http://arxiv.org/abs/2312.06668v1
- Date: Wed, 6 Dec 2023 01:32:20 GMT
- Title: Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus
- Authors: Yi-Hui Chou, Kalvin Chang, Meng-Ju Wu, Winston Ou, Alice Wen-Hsin Bi,
Carol Yang, Bryan Y. Chen, Rong-Wei Pai, Po-Yen Yeh, Jo-Peng Chiang,
Iu-Tshian Phoann, Winnie Chang, Chenxuan Cui, Noel Chen, Jiatong Shi
- Abstract summary: Taiwanese Hokkien is declining in use and status due to a language shift towards Mandarin in Taiwan.
To ensure that the state of the art in speech processing does not leave Taiwanese Hokkien behind, we contribute a 1.5-hour dataset of Taiwanese Hokkien to ML-SUPERB's hidden set.
- Score: 12.780273009783102
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Taiwanese Hokkien is declining in use and status due to a language shift
towards Mandarin in Taiwan. This is partly why it is a low resource language in
NLP and speech research today. To ensure that the state of the art in speech
processing does not leave Taiwanese Hokkien behind, we contribute a 1.5-hour
dataset of Taiwanese Hokkien to ML-SUPERB's hidden set. Evaluating ML-SUPERB's
suite of self-supervised learning (SSL) speech representations on our dataset,
we find that model size does not consistently determine performance. In fact,
certain smaller models outperform larger ones. Furthermore, linguistic
alignment between pretraining data and the target language plays a crucial
role.
Related papers
- Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties [22.274503709032317]
This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties.
International online review platforms, such as Booking.com, can serve as effective data sources.
arXiv Detail & Related papers (2025-02-10T21:49:35Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems [4.150560582918129]
We employ a pre-trained LLaMA 2-7B model specialized in Traditional Mandarin Chinese to leverage the orthographic similarities between Taiwanese Hokkien Han and Traditional Mandarin Chinese.
We find that the use of a limited monolingual corpus still further improves the model's Taiwanese Hokkien capabilities.
arXiv Detail & Related papers (2024-03-18T17:56:13Z) - Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned
Language Model [31.68119156599923]
This paper introduces Taiwan LLM, a pioneering Large Language Model that specifically caters to the Traditional Chinese language.
We have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan.
arXiv Detail & Related papers (2023-11-29T09:48:34Z) - The Interpreter Understands Your Meaning: End-to-end Spoken Language
Understanding Aided by Speech Translation [13.352795145385645]
Speech translation (ST) is a good means of pretraining speech models for end-to-end spoken language understanding.
We show that our models reach higher performance over baselines on monolingual and multilingual intent classification.
We also create new benchmark datasets for speech summarization and low-resource/zero-shot transfer from English to French or Spanish.
arXiv Detail & Related papers (2023-05-16T17:53:03Z) - Exploration of Language Dependency for Japanese Self-Supervised Speech
Representation Models [18.22157315310462]
Self-supervised learning (SSL) has been dramatically successful not only in monolingual but also in cross-lingual settings.
In this paper, we investigate how effective a cross-lingual model is in comparison with a monolingual model.
We examine how much unlabeled data collected in Japanese is needed to achieve performance comparable to a cross-lingual model pre-trained with tens of thousands of hours of English and/or multilingual data.
arXiv Detail & Related papers (2023-05-09T06:28:10Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.