PersianLLaMA: Towards Building First Persian Large Language Model
- URL: http://arxiv.org/abs/2312.15713v1
- Date: Mon, 25 Dec 2023 12:48:55 GMT
- Title: PersianLLaMA: Towards Building First Persian Large Language Model
- Authors: Mohammad Amin Abbasi, Arash Ghafouri, Mahdi Firouzmandi, Hassan Naderi
and Behrouz Minaei Bidgoli
- Abstract summary: This paper introduces the first large Persian language model, named PersianLLaMA, trained on a collection of Persian texts and datasets.
The results indicate that PersianLLaMA significantly outperforms its competitors in both understanding and generating Persian text.
- Score: 5.79461948374354
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite the widespread use of the Persian language by millions globally,
limited efforts have been made in natural language processing for this
language. The use of large language models as effective tools in various
natural language processing tasks typically requires extensive textual data and
robust hardware resources. Consequently, the scarcity of Persian textual data
and the unavailability of powerful hardware resources have hindered the
development of large language models for Persian. This paper introduces the
first large Persian language model, named PersianLLaMA, trained on a collection
of Persian texts and datasets. This foundational model comes in two versions,
with 7 and 13 billion parameters, trained on formal and colloquial Persian
texts using two different approaches. PersianLLaMA has been evaluated for
natural language generation tasks based on the latest evaluation methods,
namely using larger language models, and for natural language understanding
tasks based on automated machine metrics. The results indicate that
PersianLLaMA significantly outperforms its competitors in both understanding
and generating Persian text. PersianLLaMA marks an important step in the
development of Persian natural language processing and can be a valuable
resource for the Persian-speaking community. This large language model can be
used for various natural language processing tasks, especially text generation
like chatbots, question-answering, machine translation, and text summarization
Related papers
- PersianMind: A Cross-Lingual Persian-English Large Language Model [2.565964707090901]
We introduce PersianMind, an open-source bilingual large language model.
It demonstrates comparable performance to closed-source GPT-3.5-turbo in the Persian language.
Our approach preserves the model's English knowledge and employs transfer learning to excel at transferring task knowledge from one language to another.
arXiv Detail & Related papers (2024-01-12T09:24:10Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Improving Natural Language Inference in Arabic using Transformer Models
and Linguistically Informed Pre-Training [0.34998703934432673]
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP)
To overcome this limitation, we create a dedicated data set from publicly available resources.
We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches.
arXiv Detail & Related papers (2023-07-27T07:40:11Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Multilingual Text Classification for Dravidian Languages [4.264592074410622]
We propose a multilingual text classification framework for the Dravidian languages.
On the one hand, the framework used the LaBSE pre-trained model as the base model.
On the other hand, in view of the problem that the model cannot well recognize and utilize the correlation among languages, we further proposed a language-specific representation module.
arXiv Detail & Related papers (2021-12-03T04:26:49Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - The Challenges of Persian User-generated Textual Content: A Machine
Learning-Based Approach [0.0]
This research applies machine learning-based approaches to tackle the hurdles that come with Persian user-generated textual content.
The presented approach uses a machine-translated datasets to conduct sentiment analysis for the Persian language.
The results of the experiments have shown promising state-of-the-art performance in contrast to the previous efforts.
arXiv Detail & Related papers (2021-01-20T11:57:59Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.