Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization
- URL: http://arxiv.org/abs/2501.04858v1
- Date: Wed, 08 Jan 2025 22:16:40 GMT
- Title: Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization
- Authors: Sara Bourbour Hosseinbeigi, Sina Asghari, Mohammad Ali Seif Kashani, Mohammad Hossein Shalchian, Mohammad Amin Abbasi,
- Abstract summary: The research aims to improve retrieval and generation accuracy by introducing Persian-specific models.<n>Three datasets-general knowledge(PQuad), scientifically specialized texts, and organizational reports- were used to assess these models.<n>MatinaSRoberta outperformed previous embeddings, achieving superior contextual relevance and retrieval accuracy across datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper examines the specific obstacles of constructing Retrieval-Augmented Generation(RAG) systems in low-resource languages, with a focus on Persian's complicated morphology and versatile syntax. The research aims to improve retrieval and generation accuracy by introducing Persian-specific models, namely MatinaRoberta(a masked language model) and MatinaSRoberta(a fine-tuned Sentence-BERT), along with a comprehensive benchmarking framework. Three datasets-general knowledge(PQuad), scientifically specialized texts, and organizational reports, were used to assess these models after they were trained on a varied corpus of 73.11 billion Persian tokens. The methodology involved extensive pretraining, fine-tuning with tailored loss functions, and systematic evaluations using both traditional metrics and the Retrieval-Augmented Generation Assessment framework. The results show that MatinaSRoberta outperformed previous embeddings, achieving superior contextual relevance and retrieval accuracy across datasets. Temperature tweaking, chunk size modifications, and document summary indexing were explored to enhance RAG setups. Larger models like Llama-3.1 (70B) consistently demonstrated the highest generation accuracy, while smaller models faced challenges with domain-specific and formal contexts. The findings underscore the potential for developing RAG systems in Persian through customized embeddings and retrieval-generation settings and highlight the enhancement of NLP applications such as search engines and legal document analysis in low-resource languages.
Related papers
- Replication and Exploration of Generative Retrieval over Dynamic Corpora [87.09185685594105]
Generative retrieval (GR) has emerged as a promising paradigm in information retrieval (IR)
We show that existing GR models with numericittext-based docids show superior generalization to unseen documents.
We propose a novel multi-docid design that leverages both the efficiency of numeric-based docids and the effectiveness of text-based docids.
arXiv Detail & Related papers (2025-04-24T13:01:23Z) - Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey [29.186229489968564]
Retrieval-Augmented Generation (RAG) has revolutionized natural language processing by integrating Large Language Models (LLMs) with external information retrieval.
evaluating RAG systems presents unique challenges due to their hybrid architecture that combines retrieval and generation components.
arXiv Detail & Related papers (2025-04-21T06:39:47Z) - Building Russian Benchmark for Evaluation of Information Retrieval Models [0.0]
RusBEIR is a benchmark for evaluation of information retrieval models in the Russian language.
It integrates adapted, translated, and newly created datasets, enabling comparison of lexical and neural models.
arXiv Detail & Related papers (2025-04-17T12:11:14Z) - A Survey on Knowledge-Oriented Retrieval-Augmented Generation [45.65542434522205]
Retrieval-Augmented Generation (RAG) has gained significant attention in recent years.
RAG combines large-scale retrieval systems with generative models.
We discuss the key characteristics of RAG, such as its ability to augment generative models with dynamic external knowledge.
arXiv Detail & Related papers (2025-03-11T01:59:35Z) - PersianRAG: A Retrieval-Augmented Generation System for Persian Language [4.461903479596797]
Retrieval augmented generation (RAG) models integrate large-scale pre-trained generative models with external retrieval mechanisms.
These challenges primarily involve the preprocessing, embedding, retrieval, prompt construction, language modeling, and response evaluation of the system.
We propose novel solutions to overcome these obstacles and evaluate our approach using several Persian benchmark datasets.
arXiv Detail & Related papers (2024-11-05T06:11:17Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research [70.6584488911715]
retrieval-augmented generation (RAG) has attracted considerable research attention.
Existing RAG toolkits are often heavy and inflexibly, failing to meet the customization needs of researchers.
Our toolkit has implemented 16 advanced RAG methods and gathered and organized 38 benchmark datasets.
arXiv Detail & Related papers (2024-05-22T12:12:40Z) - Assessing generalization capability of text ranking models in Polish [0.0]
Retrieval-augmented generation (RAG) is becoming an increasingly popular technique for integrating internal knowledge bases with large language models.
In this article, we focus on the reranking problem for the Polish language, examining the performance of rerankers.
The best of our models establishes a new state-of-the-art for reranking in the Polish language, outperforming existing models with up to 30 times more parameters.
arXiv Detail & Related papers (2024-02-22T06:21:41Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - The Power of Noise: Redefining Retrieval for RAG Systems [19.387105120040157]
Retrieval-Augmented Generation (RAG) has emerged as a method to extend beyond the pre-trained knowledge of Large Language Models.
We focus on the type of passages IR systems within a RAG solution should retrieve.
arXiv Detail & Related papers (2024-01-26T14:14:59Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.