A Topic-aware Comparable Corpus of Chinese Variations
- URL: http://arxiv.org/abs/2411.10955v1
- Date: Sun, 17 Nov 2024 04:06:12 GMT
- Title: A Topic-aware Comparable Corpus of Chinese Variations
- Authors: Da-Chen Lian, Shu-Kai Hsieh,
- Abstract summary: Using Dcard for Taiwanese Mandarin and Sina Weibo for Mainland Chinese, we create a comparable corpus that updates regularly and reflects modern language use on social media.
- Score: 0.6906005491572401
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study aims to fill the gap by constructing a topic-aware comparable corpus of Mainland Chinese Mandarin and Taiwanese Mandarin from the social media in Mainland China and Taiwan, respectively. Using Dcard for Taiwanese Mandarin and Sina Weibo for Mainland Chinese, we create a comparable corpus that updates regularly and reflects modern language use on social media.
Related papers
- Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties [22.274503709032317]
This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties.
International online review platforms, such as Booking.com, can serve as effective data sources.
arXiv Detail & Related papers (2025-02-10T21:49:35Z) - Building a Taiwanese Mandarin Spoken Language Model: A First Attempt [44.54200115439157]
This report aims to build a large spoken language model (MLL) for Taiwanese Mandarin tailored to enable realtime speech interaction in multi-turn conversations.
Our end-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving conversational fluency flow.
arXiv Detail & Related papers (2024-11-11T16:37:40Z) - When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems [4.150560582918129]
We employ a pre-trained LLaMA 2-7B model specialized in Traditional Mandarin Chinese to leverage the orthographic similarities between Taiwanese Hokkien Han and Traditional Mandarin Chinese.
We find that the use of a limited monolingual corpus still further improves the model's Taiwanese Hokkien capabilities.
arXiv Detail & Related papers (2024-03-18T17:56:13Z) - Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned
Language Model [31.68119156599923]
This paper introduces Taiwan LLM, a pioneering Large Language Model that specifically caters to the Traditional Chinese language.
We have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan.
arXiv Detail & Related papers (2023-11-29T09:48:34Z) - Enhancing Cross-lingual Transfer via Phonemic Transcription Integration [57.109031654219294]
PhoneXL is a framework incorporating phonemic transcriptions as an additional linguistic modality for cross-lingual transfer.
Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer.
arXiv Detail & Related papers (2023-07-10T06:17:33Z) - A New Dataset and Empirical Study for Sentence Simplification in Chinese [50.0624778757462]
This paper introduces CSS, a new dataset for assessing sentence simplification in Chinese.
We collect manual simplifications from human annotators and perform data analysis to show the difference between English and Chinese sentence simplifications.
In the end, we explore whether Large Language Models can serve as high-quality Chinese sentence simplification systems by evaluating them on CSS.
arXiv Detail & Related papers (2023-06-07T06:47:34Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - Cross-strait Variations on Two Near-synonymous Loanwords xie2shang1 and
tan2pan4: A Corpus-based Comparative Study [2.6194322370744305]
This study attempts to investigate cross-strait variations on two typical synonymous loanwords in Chinese, i.e. xie2shang1 and tan2pan4.
Through a comparative analysis, the study found some distributional, eventual, and contextual similarities and differences across Taiwan and Mainland Mandarin.
arXiv Detail & Related papers (2022-10-09T04:10:58Z) - An Analysis of the Differences Among Regional Varieties of Chinese in
Malay Archipelago [5.030581940990434]
Chinese features prominently in the Chinese communities located in the nations of Malay Archipelago.
Chinese has undergone the process of adjustment to the local languages and cultures, which leads to the occurrence of a Chinese variant in each country.
arXiv Detail & Related papers (2022-09-10T07:29:25Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - A Corpus of Adpositional Supersenses for Mandarin Chinese [15.757892250956715]
This paper presents a corpus in which all adpositions have been semantically annotated in Mandarin Chinese.
Our approach adapts a framework that defined a general set of supersenses according to ostensibly language-independent semantic criteria.
We find that the supersense categories are well-suited to Chinese adpositions despite syntactic differences from English.
arXiv Detail & Related papers (2020-03-18T18:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.