PersianRAG: A Retrieval-Augmented Generation System for Persian Language
- URL: http://arxiv.org/abs/2411.02832v2
- Date: Wed, 06 Nov 2024 11:19:42 GMT
- Title: PersianRAG: A Retrieval-Augmented Generation System for Persian Language
- Authors: Hossein Hosseini, Mohammad Sobhan Zare, Amir Hossein Mohammadi, Arefeh Kazemi, Zahra Zojaji, Mohammad Ali Nematbakhsh,
- Abstract summary: Retrieval augmented generation (RAG) models integrate large-scale pre-trained generative models with external retrieval mechanisms.
These challenges primarily involve the preprocessing, embedding, retrieval, prompt construction, language modeling, and response evaluation of the system.
We propose novel solutions to overcome these obstacles and evaluate our approach using several Persian benchmark datasets.
- Score: 4.461903479596797
- License:
- Abstract: Retrieval augmented generation (RAG) models, which integrate large-scale pre-trained generative models with external retrieval mechanisms, have shown significant success in various natural language processing (NLP) tasks. However, applying RAG models in Persian language as a low-resource language, poses distinct challenges. These challenges primarily involve the preprocessing, embedding, retrieval, prompt construction, language modeling, and response evaluation of the system. In this paper, we address the challenges towards implementing a real-world RAG system for Persian language called PersianRAG. We propose novel solutions to overcome these obstacles and evaluate our approach using several Persian benchmark datasets. Our experimental results demonstrate the capability of the PersianRAG framework to enhance question answering task in Persian.
Related papers
- Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization [0.0]
The research aims to improve retrieval and generation accuracy by introducing Persian-specific models.
Three datasets-general knowledge(PQuad), scientifically specialized texts, and organizational reports- were used to assess these models.
MatinaSRoberta outperformed previous embeddings, achieving superior contextual relevance and retrieval accuracy across datasets.
arXiv Detail & Related papers (2025-01-08T22:16:40Z) - Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.
However, Whisper struggles with unseen languages, those not included in its pre-training.
We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z) - QueEn: A Large Language Model for Quechua-English Translation [20.377876059048692]
We propose QueEn, a novel approach for Quechua-English translation that combines Retrieval-Augmented Generation (RAG) with parameter-efficient fine-tuning techniques.
Our approach substantially exceeds baseline models, with a BLEU score of 17.6 compared to 1.5 for standard GPT models.
arXiv Detail & Related papers (2024-12-06T17:04:21Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.
With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance.
Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - DPP-Based Adversarial Prompt Searching for Lanugage Models [56.73828162194457]
Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP)
Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
arXiv Detail & Related papers (2024-03-01T05:28:06Z) - Large Language Models for Information Retrieval: A Survey [58.30439850203101]
Information retrieval has evolved from term-based methods to its integration with advanced neural models.
Recent research has sought to leverage large language models (LLMs) to improve IR systems.
We delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers.
arXiv Detail & Related papers (2023-08-14T12:47:22Z) - SimOAP: Improve Coherence and Consistency in Persona-based Dialogue
Generation via Over-sampling and Post-evaluation [54.66399120084227]
Language models trained on large-scale corpora can generate remarkably fluent results in open-domain dialogue.
For the persona-based dialogue generation task, consistency and coherence are great challenges for language models.
A two-stage SimOAP strategy is proposed, i.e., over-sampling and post-evaluation.
arXiv Detail & Related papers (2023-05-18T17:23:00Z) - ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual
Open-retrieval Question Answering System [16.89747171947662]
This paper introduces our proposed system for the MIA Shared Task on Cross-lingual Open-retrieval Question Answering (COQA)
In this challenging scenario, given an input question the system has to gather evidence documents from a multilingual pool and generate an answer in the language of the question.
We devised several approaches combining different model variants for three main components: Data Augmentation, Passage Retrieval, and Answer Generation.
arXiv Detail & Related papers (2022-05-30T10:31:08Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.