Related papers: PersianLLaMA: Towards Building First Persian Large Language Model

PersianLLaMA: Towards Building First Persian Large Language Model

URL: http://arxiv.org/abs/2312.15713v1
Date: Mon, 25 Dec 2023 12:48:55 GMT
Title: PersianLLaMA: Towards Building First Persian Large Language Model
Authors: Mohammad Amin Abbasi, Arash Ghafouri, Mahdi Firouzmandi, Hassan Naderi and Behrouz Minaei Bidgoli
Abstract summary: This paper introduces the first large Persian language model, named PersianLLaMA, trained on a collection of Persian texts and datasets. The results indicate that PersianLLaMA significantly outperforms its competitors in both understanding and generating Persian text.
Score: 5.79461948374354
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Despite the widespread use of the Persian language by millions globally, limited efforts have been made in natural language processing for this language. The use of large language models as effective tools in various natural language processing tasks typically requires extensive textual data and robust hardware resources. Consequently, the scarcity of Persian textual data and the unavailability of powerful hardware resources have hindered the development of large language models for Persian. This paper introduces the first large Persian language model, named PersianLLaMA, trained on a collection of Persian texts and datasets. This foundational model comes in two versions, with 7 and 13 billion parameters, trained on formal and colloquial Persian texts using two different approaches. PersianLLaMA has been evaluated for natural language generation tasks based on the latest evaluation methods, namely using larger language models, and for natural language understanding tasks based on automated machine metrics. The results indicate that PersianLLaMA significantly outperforms its competitors in both understanding and generating Persian text. PersianLLaMA marks an important step in the development of Persian natural language processing and can be a valuable resource for the Persian-speaking community. This large language model can be used for various natural language processing tasks, especially text generation like chatbots, question-answering, machine translation, and text summarization

Related papers

Matina: A Large-Scale 73B Token Persian Text Corpus [1.396406461086233]
Existing Persian datasets are typically small and lack content diversity, consisting mainly of weblogs and news articles. Matina corpus is a new Persian dataset of 72.9B tokens, carefully preprocessed and deduplicated to ensure high data quality.
arXiv Detail & Related papers (2025-02-13T11:22:19Z)
Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation [36.92567530333872]
We study adding a new language, i.e. Persian, to a large language model (LLMs) We employ a multi-stage approach involving pretraining on monolingual Persian data. We evaluate the model's performance at each stage on generation and classification tasks.
arXiv Detail & Related papers (2024-12-17T23:18:06Z)
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction. Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese. We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z)
FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts [0.0]
This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks. It is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language. It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria.
arXiv Detail & Related papers (2024-07-27T05:04:49Z)
PersianMind: A Cross-Lingual Persian-English Large Language Model [2.565964707090901]
We introduce PersianMind, an open-source bilingual large language model. It demonstrates comparable performance to closed-source GPT-3.5-turbo in the Persian language. Our approach preserves the model's English knowledge and employs transfer learning to excel at transferring task knowledge from one language to another.
arXiv Detail & Related papers (2024-01-12T09:24:10Z)
Improving Natural Language Inference in Arabic using Transformer Models and Linguistically Informed Pre-Training [0.34998703934432673]
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP) To overcome this limitation, we create a dedicated data set from publicly available resources. We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches.
arXiv Detail & Related papers (2023-07-27T07:40:11Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
Romanization-based Large-scale Adaptation of Multilingual Language Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z)
Multilingual Text Classification for Dravidian Languages [4.264592074410622]
We propose a multilingual text classification framework for the Dravidian languages. On the one hand, the framework used the LaBSE pre-trained model as the base model. On the other hand, in view of the problem that the model cannot well recognize and utilize the correlation among languages, we further proposed a language-specific representation module.
arXiv Detail & Related papers (2021-12-03T04:26:49Z)
Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages. We infer this distribution from a sample of typologically diverse training languages. We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z)
Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages. We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z)
The Challenges of Persian User-generated Textual Content: A Machine Learning-Based Approach [0.0]
This research applies machine learning-based approaches to tackle the hurdles that come with Persian user-generated textual content. The presented approach uses a machine-translated datasets to conduct sentiment analysis for the Persian language. The results of the experiments have shown promising state-of-the-art performance in contrast to the previous efforts.
arXiv Detail & Related papers (2021-01-20T11:57:59Z)
Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages. We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC) LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language. We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.