Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems
- URL: http://arxiv.org/abs/2403.12024v2
- Date: Tue, 14 May 2024 05:49:00 GMT
- Title: Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems
- Authors: Bo-Han Lu, Yi-Hsuan Lin, En-Shiun Annie Lee, Richard Tzong-Han Tsai,
- Abstract summary: We employ a pre-trained LLaMA 2-7B model specialized in Traditional Mandarin Chinese to leverage the orthographic similarities between Taiwanese Hokkien Han and Traditional Mandarin Chinese.
We find that the use of a limited monolingual corpus still further improves the model's Taiwanese Hokkien capabilities.
- Score: 4.150560582918129
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Machine translation focuses mainly on high-resource languages (HRLs), while low-resource languages (LRLs) like Taiwanese Hokkien are relatively under-explored. The study aims to address this gap by developing a dual translation model between Taiwanese Hokkien and both Traditional Mandarin Chinese and English. We employ a pre-trained LLaMA 2-7B model specialized in Traditional Mandarin Chinese to leverage the orthographic similarities between Taiwanese Hokkien Han and Traditional Mandarin Chinese. Our comprehensive experiments involve translation tasks across various writing systems of Taiwanese Hokkien as well as between Taiwanese Hokkien and other HRLs. We find that the use of a limited monolingual corpus still further improves the model's Taiwanese Hokkien capabilities. We then utilize our translation model to standardize all Taiwanese Hokkien writing systems into Hokkien Han, resulting in further performance improvements. Additionally, we introduce an evaluation method incorporating back-translation and GPT-4 to ensure reliable translation quality assessment even for LRLs. The study contributes to narrowing the resource gap for Taiwanese Hokkien and empirically investigates the advantages and limitations of pre-training and fine-tuning based on LLaMA 2.
Related papers
- Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties [22.274503709032317]
This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties.
International online review platforms, such as Booking.com, can serve as effective data sources.
arXiv Detail & Related papers (2025-02-10T21:49:35Z) - When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Measuring Taiwanese Mandarin Language Understanding [24.581360653015423]
We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in large language models (LLMs)
TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels.
arXiv Detail & Related papers (2024-03-29T13:56:21Z) - Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus [12.780273009783102]
Taiwanese Hokkien is declining in use and status due to a language shift towards Mandarin in Taiwan.
To ensure that the state of the art in speech processing does not leave Taiwanese Hokkien behind, we contribute a 1.5-hour dataset of Taiwanese Hokkien to ML-SUPERB's hidden set.
arXiv Detail & Related papers (2023-12-06T01:32:20Z) - Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned
Language Model [31.68119156599923]
This paper introduces Taiwan LLM, a pioneering Large Language Model that specifically caters to the Traditional Chinese language.
We have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan.
arXiv Detail & Related papers (2023-11-29T09:48:34Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A
Case Study in Taiwanese Hokkien [5.272372029223681]
In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants.
We propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method.
arXiv Detail & Related papers (2023-01-21T11:04:20Z) - Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences.
Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem.
We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z) - Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models.
In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them.
We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z) - Evaluating Low-Resource Machine Translation between Chinese and
Vietnamese with Back-Translation [32.25731930652532]
Back translation (BT) has been widely used and become one of standard techniques for data augmentation in Neural Machine Translation (NMT)
We evaluate and compare the effects of different sizes of synthetic data on both NMT and Statistical Machine Translation (SMT) models for Chinese to Vietnamese and Vietnamese to Chinese, with character-based and word-based settings.
arXiv Detail & Related papers (2020-03-04T17:10:10Z) - Pre-training via Leveraging Assisting Languages and Data Selection for
Neural Machine Translation [49.51278300110449]
We propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the languages of interest.
A case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora.
arXiv Detail & Related papers (2020-01-23T02:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.