PersianMind: A Cross-Lingual Persian-English Large Language Model
- URL: http://arxiv.org/abs/2401.06466v1
- Date: Fri, 12 Jan 2024 09:24:10 GMT
- Title: PersianMind: A Cross-Lingual Persian-English Large Language Model
- Authors: Pedram Rostami, Ali Salemi, Mohammad Javad Dousti
- Abstract summary: We introduce PersianMind, an open-source bilingual large language model.
It demonstrates comparable performance to closed-source GPT-3.5-turbo in the Persian language.
Our approach preserves the model's English knowledge and employs transfer learning to excel at transferring task knowledge from one language to another.
- Score: 2.565964707090901
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models demonstrate remarkable proficiency in various
linguistic tasks and have extensive knowledge across various domains. Although
they perform best in English, their ability in other languages is notable too.
In contrast, open-source models, such as LLaMa, are primarily trained on
English datasets, resulting in poor performance in non-English languages. In
this paper, we introduce PersianMind, an open-source bilingual large language
model which demonstrates comparable performance to closed-source GPT-3.5-turbo
in the Persian language. By expanding LLaMa2's vocabulary with 10,000 Persian
tokens and training it on a dataset comprising nearly 2 billion Persian tokens,
we show that our approach preserves the model's English knowledge and employs
transfer learning to excel at transferring task knowledge from one language to
another.
Related papers
- Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Poro 34B and the Blessing of Multilinguality [3.270981284471548]
Poro 34B is a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages.
We show that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish.
arXiv Detail & Related papers (2024-04-02T11:34:12Z) - Aya Model: An Instruction Finetuned Open-Access Multilingual Language
Model [33.87586041774359]
Aya is a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced.
We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages.
We conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.
arXiv Detail & Related papers (2024-02-12T17:34:13Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - PersianLLaMA: Towards Building First Persian Large Language Model [5.79461948374354]
This paper introduces the first large Persian language model, named PersianLLaMA, trained on a collection of Persian texts and datasets.
The results indicate that PersianLLaMA significantly outperforms its competitors in both understanding and generating Persian text.
arXiv Detail & Related papers (2023-12-25T12:48:55Z) - Language Representation Projection: Can We Transfer Factual Knowledge
across Languages in Multilingual Language Models? [48.88328580373103]
We propose two parameter-free $textbfL$anguage $textbfR$epresentation $textbfP$rojection modules (LRP2)
The first module converts non-English representations into English-like equivalents, while the second module reverts English-like representations back into representations of the corresponding non-English language.
Experimental results on the mLAMA dataset demonstrate that LRP2 significantly improves factual knowledge retrieval accuracy and facilitates knowledge transferability across diverse non-English languages.
arXiv Detail & Related papers (2023-11-07T08:16:16Z) - Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Bilingual Language Modeling, A transfer learning technique for Roman
Urdu [0.0]
We show how code-switching property of languages may be used to perform cross-lingual transfer learning from a corresponding high resource language.
We also show how this transfer learning technique termed Bilingual Language Modeling can be used to produce better performing models for Roman Urdu.
arXiv Detail & Related papers (2021-02-22T12:56:37Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.