Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned
Language Model
- URL: http://arxiv.org/abs/2311.17487v1
- Date: Wed, 29 Nov 2023 09:48:34 GMT
- Title: Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned
Language Model
- Authors: Yen-Ting Lin, Yun-Nung Chen
- Abstract summary: This paper introduces Taiwan LLM, a pioneering Large Language Model that specifically caters to the Traditional Chinese language.
We have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan.
- Score: 31.68119156599923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the realm of language models, the nuanced linguistic and cultural
intricacies of Traditional Chinese, as spoken in Taiwan, have been largely
overlooked. This paper introduces Taiwan LLM, a pioneering Large Language Model
that specifically caters to the Traditional Chinese language, with a focus on
the variant used in Taiwan. Leveraging a comprehensive pretraining corpus and
instruction-finetuning datasets, we have developed a model that not only
understands the complexities of Traditional Chinese but also embodies the
cultural context of Taiwan. Taiwan LLM represents the first of its kind, a
model that is not only linguistically accurate but also culturally resonant
with its user base. Our evaluations demonstrate that Taiwan LLM achieves
superior performance in understanding and generating Traditional Chinese text,
outperforming existing models that are predominantly trained on Simplified
Chinese or English. The open-source release of Taiwan LLM invites collaboration
and further innovation, ensuring that the linguistic diversity of Chinese
speakers is embraced and well-served. The model, datasets, and further
resources are made publicly available to foster ongoing research and
development in this field.
Related papers
- All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages [73.93600813999306]
ALM-bench is the largest and most comprehensive effort to date for evaluating LMMs across 100 languages.
It challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages.
The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions.
arXiv Detail & Related papers (2024-11-25T15:44:42Z) - How Chinese are Chinese Language Models? The Puzzling Lack of Language Policy in China's LLMs [2.9123921488295768]
We evaluate six open-source multilingual LLMs pre-trained by Chinese companies on 18 languages.
Our experiments show Chinese LLMs performance on diverse languages is indistinguishable from international LLMs.
We find no sign of any consistent policy, either for or against, language diversity in China's LLM development.
arXiv Detail & Related papers (2024-07-12T19:21:40Z) - MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process.
Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z) - Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance [6.907734681124986]
This paper strategically identifies the need for linguistic equity by examining several knowledge editing techniques in multilingual contexts.
We evaluate the performance of models such as Mistral, TowerInstruct, OpenHathi, Tamil-Llama, and Kan-Llama across languages including English, German, French, Italian, Spanish, Hindi, Tamil, and Kannada.
arXiv Detail & Related papers (2024-06-17T01:54:27Z) - Measuring Taiwanese Mandarin Language Understanding [24.581360653015423]
We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in large language models (LLMs)
TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels.
arXiv Detail & Related papers (2024-03-29T13:56:21Z) - Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems [4.150560582918129]
We employ a pre-trained LLaMA 2-7B model specialized in Traditional Mandarin Chinese to leverage the orthographic similarities between Taiwanese Hokkien Han and Traditional Mandarin Chinese.
We find that the use of a limited monolingual corpus still further improves the model's Taiwanese Hokkien capabilities.
arXiv Detail & Related papers (2024-03-18T17:56:13Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus [12.780273009783102]
Taiwanese Hokkien is declining in use and status due to a language shift towards Mandarin in Taiwan.
To ensure that the state of the art in speech processing does not leave Taiwanese Hokkien behind, we contribute a 1.5-hour dataset of Taiwanese Hokkien to ML-SUPERB's hidden set.
arXiv Detail & Related papers (2023-12-06T01:32:20Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - LERT: A Linguistically-motivated Pre-trained Language Model [67.65651497173998]
We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original pre-training task.
We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements.
arXiv Detail & Related papers (2022-11-10T05:09:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.