Related papers: Sun-Shine: A Large Language Model for Tibetan Culture

Sun-Shine: A Large Language Model for Tibetan Culture

URL: http://arxiv.org/abs/2503.18288v2
Date: Fri, 28 Mar 2025 03:35:17 GMT
Title: Sun-Shine: A Large Language Model for Tibetan Culture
Authors: Cheng Huang, Fan Gao, Nyima Tashi, Yutong Liu, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Yongbin Yu,
Abstract summary: We introduce Llama-Sunshine (Sun-Shine), the first large language model for Tibetan culture.<n>Sun-Shine incorporates state-of-the-art model optimized architectures for Tibetan's linguistic features.<n>We also propose TIB-STC, a comprehensive dataset comprising diverse Tibetan texts.
Score: 8.303987580599266
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tibetan, a minority language in China, features a highly intricate grammatical structure, characterized by four verb tenses and a tense system with frequent irregularities, contributing to its extensive inflectional diversity. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in many domains. Despite the success in other fields, current LLMs often fall short in catering to the needs of domain experts like Tibetans, and the potential of LLMs for Tibetan culture is under-explored. The intrinsic reasons are the immense and intricate nature of Tibetan culture as well as the necessity for higher granularity and richness in knowledge. Simultaneously, the complexity and uniqueness of its grammatical structure, coupled with its status as a minority ethnic language, contribute to data scarcity, which remains a fundamental challenge. To alleviate these issues, we introduce Llama-Sunshine (Sun-Shine), the first large language model for Tibetan culture, which is expert in various Tibetan language processing tasks. Sun-Shine incorporates state-of-the-art model architectures optimized for Tibetan's linguistic features. We also propose TIB-STC, a comprehensive dataset comprising diverse Tibetan texts such as literature, religious scripts, news, and conversational data, which is also the first large-scale dataset for Tibetan culture. Though comprehensive experiments, Sun-Shine not only demonstrates a higher level of knowledge expertise for Tibetan culture but also gains preliminary embodied intelligence capabilities in Tibetan language processing tasks, like language modeling, text classification, machine translation, and syntactic analysis. Moreover, it excels in low-resource scenarios, showcasing strong generalization capabilities.

Related papers

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages [73.93600813999306]
ALM-bench is the largest and most comprehensive effort to date for evaluating LMMs across 100 languages.<n>It challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages.<n>The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions.
arXiv Detail & Related papers (2024-11-25T15:44:42Z)
LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages. By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z)
Unification of Balti and trans-border sister dialects in the essence of LLMs and AI Technology [19.282867207168565]
The language Balti belongs to the Sino-Tibetan, specifically the Tibeto-Burman language family. It is understood with variations, across populations in India, China, Pakistan, Nepal, Tibet, Burma, and Bhutan. Considering the diverse cultural, socio-political, religious, and geographical impacts, it is important to step forward unifying the dialects.
arXiv Detail & Related papers (2024-11-20T15:48:21Z)
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model [31.68119156599923]
This paper introduces Taiwan LLM, a pioneering Large Language Model that specifically caters to the Traditional Chinese language. We have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan.
arXiv Detail & Related papers (2023-11-29T09:48:34Z)
MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China [33.08119305158835]
We present MC$2$, a Multilingual Corpus of Minority Languages in China. MC$2$ includes four underrepresented languages: Tibetan, Uyghur, Kazakh, and Mongolian.
arXiv Detail & Related papers (2023-11-14T17:45:50Z)
PEFTT: Parameter-Efficient Fine-Tuning for low-resource Tibetan pre-trained language models [0.0]
There is currently no existing large language model for Tibetan due to its low-resource nature. We conducted three types of efficient fine-tuning experiments on the publicly available TNCC-title dataset.
arXiv Detail & Related papers (2023-09-21T14:29:23Z)
Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z)
Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP. We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba. Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region. All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z)
TiBERT: Tibetan Pre-trained Language Model [2.9554549423413303]
This paper collects the large-scale training data from Tibetan websites and constructs a vocabulary that can cover 99.95$%$ of the words in the corpus by using Sentencepiece. We apply TiBERT to the downstream tasks of text classification and question generation, and compare it with classic models and multilingual pre-trained models.
arXiv Detail & Related papers (2022-05-15T14:45:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.