Unification of Balti and trans-border sister dialects in the essence of LLMs and AI Technology
- URL: http://arxiv.org/abs/2411.13409v1
- Date: Wed, 20 Nov 2024 15:48:21 GMT
- Title: Unification of Balti and trans-border sister dialects in the essence of LLMs and AI Technology
- Authors: Muhammad Sharif, Jiangyan Yi, Muhammad Shoaib,
- Abstract summary: The language Balti belongs to the Sino-Tibetan, specifically the Tibeto-Burman language family.
It is understood with variations, across populations in India, China, Pakistan, Nepal, Tibet, Burma, and Bhutan.
Considering the diverse cultural, socio-political, religious, and geographical impacts, it is important to step forward unifying the dialects.
- Score: 19.282867207168565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The language called Balti belongs to the Sino-Tibetan, specifically the Tibeto-Burman language family. It is understood with variations, across populations in India, China, Pakistan, Nepal, Tibet, Burma, and Bhutan, influenced by local cultures and producing various dialects. Considering the diverse cultural, socio-political, religious, and geographical impacts, it is important to step forward unifying the dialects, the basis of common root, lexica, and phonological perspectives, is vital. In the era of globalization and the increasingly frequent developments in AI technology, understanding the diversity and the efforts of dialect unification is important to understanding commonalities and shortening the gaps impacted by unavoidable circumstances. This article analyzes and examines how artificial intelligence AI in the essence of Large Language Models LLMs, can assist in analyzing, documenting, and standardizing the endangered Balti Language, based on the efforts made in different dialects so far.
Related papers
- Sun-Shine: A Large Language Model for Tibetan Culture [8.303987580599266]
We introduce Llama-Sunshine (Sun-Shine), the first large language model for Tibetan culture.
Sun-Shine incorporates state-of-the-art model optimized architectures for Tibetan's linguistic features.
We also propose TIB-STC, a comprehensive dataset comprising diverse Tibetan texts.
arXiv Detail & Related papers (2025-03-24T02:17:41Z) - BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization [7.059964549363294]
The study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech.
Being the fifth most spoken language with around 55 distinct dialects spoken by 160 million people, addressing Bangla dialects is crucial for developing inclusive communication tools.
Our experiments demonstrated that fine-tuning the Whisper ASR model achieved a CER of 0.8% and WER of 1.5%, while the BanglaT5 model attained a BLEU score of 41.6% for dialect-to-standard text translation.
arXiv Detail & Related papers (2024-11-16T20:20:15Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance [6.907734681124986]
This paper strategically identifies the need for linguistic equity by examining several knowledge editing techniques in multilingual contexts.
We evaluate the performance of models such as Mistral, TowerInstruct, OpenHathi, Tamil-Llama, and Kan-Llama across languages including English, German, French, Italian, Spanish, Hindi, Tamil, and Kannada.
arXiv Detail & Related papers (2024-06-17T01:54:27Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Content-Localization based System for Analyzing Sentiment and Hate
Behaviors in Low-Resource Dialectal Arabic: English to Levantine and Gulf [5.2957928879391]
This paper proposes to localize content of resources in high-resourced languages into under-resourced Arabic dialects.
We utilize content-localization based neural machine translation to develop sentiment and hate classifiers for two low-resourced Arabic dialects: Levantine and Gulf.
Our findings shed light on the importance of considering the unique nature of dialects within the same language and ignoring the dialectal aspect would lead to misleading analysis.
arXiv Detail & Related papers (2023-11-27T15:37:33Z) - Task-Agnostic Low-Rank Adapters for Unseen English Dialects [52.88554155235167]
Large Language Models (LLMs) are trained on corpora disproportionally weighted in favor of Standard American English.
By disentangling dialect-specific and cross-dialectal information, HyperLoRA improves generalization to unseen dialects in a task-agnostic fashion.
arXiv Detail & Related papers (2023-11-02T01:17:29Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Capturing the diversity of multilingual societies [0.0]
We consider the processes at work in language shift through a conjunction of theoretical and data-driven perspectives.
A large-scale empirical study of spatial patterns of languages in multilingual societies using Twitter and census data yields a wide diversity.
We propose a model in which coexistence of languages may be reached when learning the other language is facilitated and when bilinguals favor the use of the endangered language.
arXiv Detail & Related papers (2021-05-06T10:27:43Z) - The State and Fate of Linguistic Diversity and Inclusion in the NLP
World [12.936270946393483]
Language technologies contribute to promoting multilingualism and linguistic diversity around the world.
Only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications.
arXiv Detail & Related papers (2020-04-20T07:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.