Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning
- URL: http://arxiv.org/abs/2510.12807v1
- Date: Sun, 05 Oct 2025 10:10:04 GMT
- Title: Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning
- Authors: Mahdi Cherakhloo, Arash Abbasi, Mohammad Saeid Sarafraz, Bijan Vosoughi Vahdat,
- Abstract summary: This paper presents a benchmark of several open-source Large Language Models (LLMs) for Persian Natural Language Processing tasks.<n>We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering.<n>The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous languages; however, their effectiveness in low-resource languages like Persian requires thorough investigation. This paper presents a comprehensive benchmark of several open-source LLMs for Persian Natural Language Processing (NLP) tasks, utilizing both zero-shot and few-shot learning paradigms. We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering, using established Persian datasets such as ParsiNLU and ArmanEmo. Our methodology encompasses rigorous experimental setups for both zero-shot and few-shot scenarios, employing metrics such as Accuracy, F1-score, BLEU, and ROUGE for performance evaluation. The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms, with particularly strong performance in complex reasoning tasks. However, most models struggle with token-level understanding tasks like Named Entity Recognition, highlighting specific challenges in Persian language processing. This study contributes to the growing body of research on multilingual LLMs, providing valuable insights into their performance in Persian and offering a benchmark for future model development.
Related papers
- Multilingual Definition Modeling [1.9409995498330783]
We use monolingual dictionary data for four new languages (Spanish, French, Portuguese, and German)<n>We test the performance of pre-trained multilingual language models on definition modeling of monosemic words when finetuned on this data.<n>Results show that multilingual language models can perform on-pair with English but cannot leverage potential cross-lingual synergies.
arXiv Detail & Related papers (2025-06-02T09:48:37Z) - Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z) - Bactrainus: Optimizing Large Language Models for Multi-hop Complex Question Answering Tasks [5.439505575097552]
We evaluate the ability of large language models in performing domain-specific tasks using the HotpotQA dataset.<n>This task serves as a challenging benchmark for assessing the language comprehension capabilities of these models.<n>The results of the study show that the integration of large language models with these techniques can lead to up to a 4% improvement in F1 score for finding answers.
arXiv Detail & Related papers (2025-01-10T18:44:06Z) - Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation [36.92567530333872]
We study adding a new language, i.e. Persian, to a large language model (LLMs)<n>We employ a multi-stage approach involving pretraining on monolingual Persian data.<n>We evaluate the model's performance at each stage on generation and classification tasks.
arXiv Detail & Related papers (2024-12-17T23:18:06Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages.
This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.