Optimizing RAG Pipelines for Arabic: A Systematic Analysis of Core Components
- URL: http://arxiv.org/abs/2506.06339v1
- Date: Sun, 01 Jun 2025 00:04:58 GMT
- Title: Optimizing RAG Pipelines for Arabic: A Systematic Analysis of Core Components
- Authors: Jumana Alsubhi, Mohammad D. Alahmadi, Ahmed Alhusayni, Ibrahim Aldailami, Israa Hamdine, Ahmad Shabana, Yazeed Iskandar, Suhayb Khayyat,
- Abstract summary: Retrieval-Augmented Generation (RAG) has emerged as a powerful architecture for combining the precision of retrieval systems with the fluency of large language models.<n>This study presents a comprehensive empirical evaluation of state-of-the-art RAG components-including chunking strategies, embedding models, rerankers, and language models-across a diverse set of Arabic datasets.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful architecture for combining the precision of retrieval systems with the fluency of large language models. While several studies have investigated RAG pipelines for high-resource languages, the optimization of RAG components for Arabic remains underexplored. This study presents a comprehensive empirical evaluation of state-of-the-art RAG components-including chunking strategies, embedding models, rerankers, and language models-across a diverse set of Arabic datasets. Using the RAGAS framework, we systematically compare performance across four core metrics: context precision, context recall, answer faithfulness, and answer relevancy. Our experiments demonstrate that sentence-aware chunking outperforms all other segmentation methods, while BGE-M3 and Multilingual-E5-large emerge as the most effective embedding models. The inclusion of a reranker (bge-reranker-v2-m3) significantly boosts faithfulness in complex datasets, and Aya-8B surpasses StableLM in generation quality. These findings provide critical insights for building high-quality Arabic RAG pipelines and offer practical guidelines for selecting optimal components across different document types.
Related papers
- CrosGrpsABS: Cross-Attention over Syntactic and Semantic Graphs for Aspect-Based Sentiment Analysis in a Low-Resource Language [0.5937476291232802]
Aspect-Based Sentiment Analysis (ABSA) is a fundamental task in natural language processing, offering fine-grained insights into opinions expressed in text.<n>This research propose CrosGrpsABS, a novel hybrid framework that leverages bidirectional cross-attention between syntactic and semantic graphs to enhance aspect-level sentiment classification.<n>We evaluate CrosGrpsABS on four low-resource Bengali ABSA datasets and the high-resource English SemEval 2014 Task 4 dataset.
arXiv Detail & Related papers (2025-05-25T07:42:32Z) - Fine-Tuning Large Language Models and Evaluating Retrieval Methods for Improved Question Answering on Building Codes [0.0]
Building codes are regulations that establish standards for the design, construction, and safety of buildings to ensure structural integrity, fire protection, and accessibility.<n>Key difficulties include navigating large volumes of text, interpreting technical language, and identifying relevant clauses across different sections.<n>A potential solution is to build a Question-Answering (QA) system that answers user queries based on building codes.<n>Among the various methods for building a QA system, Retrieval-Augmented Generation (RAG) stands out in performance.
arXiv Detail & Related papers (2025-05-07T05:04:30Z) - Context-Guided Dynamic Retrieval for Improving Generation Quality in RAG Models [2.9687381456164004]
It proposes a state-aware dynamic knowledge retrieval mechanism to enhance semantic understanding and knowledge scheduling efficiency.<n>The proposed structure is thoroughly evaluated across different large models, including GPT-4, GPT-4o, and DeepSeek.<n>The approach also demonstrates stronger robustness and generation consistency in tasks involving semantic ambiguity and multi-document fusion.
arXiv Detail & Related papers (2025-04-28T02:50:45Z) - AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset [95.45316956434608]
Preference learning is critical for aligning large language models with human values.<n>Our work shifts preference dataset design from ad hoc scaling to component-aware optimization.
arXiv Detail & Related papers (2025-04-04T17:33:07Z) - Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization [0.0]
The research aims to improve retrieval and generation accuracy by introducing Persian-specific models.<n>Three datasets-general knowledge(PQuad), scientifically specialized texts, and organizational reports- were used to assess these models.<n>MatinaSRoberta outperformed previous embeddings, achieving superior contextual relevance and retrieval accuracy across datasets.
arXiv Detail & Related papers (2025-01-08T22:16:40Z) - What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.<n>Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.<n>We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.<n>Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z) - RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation [8.377398103067508]
We introduce RAG Foundry, an open-source framework for augmenting large language models for RAG use cases.
RAG Foundry integrates data creation, training, inference and evaluation into a single workflow.
We demonstrate the framework effectiveness by augmenting and fine-tuning Llama-3 and Phi-3 models with diverse RAG configurations.
arXiv Detail & Related papers (2024-08-05T15:16:24Z) - FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research [70.6584488911715]
retrieval-augmented generation (RAG) has attracted considerable research attention.<n>Existing RAG toolkits are often heavy and inflexibly, failing to meet the customization needs of researchers.<n>Our toolkit has implemented 16 advanced RAG methods and gathered and organized 38 benchmark datasets.
arXiv Detail & Related papers (2024-05-22T12:12:40Z) - CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models [49.16989035566899]
Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources.
This paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios.
arXiv Detail & Related papers (2024-01-30T14:25:32Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - SimOAP: Improve Coherence and Consistency in Persona-based Dialogue
Generation via Over-sampling and Post-evaluation [54.66399120084227]
Language models trained on large-scale corpora can generate remarkably fluent results in open-domain dialogue.
For the persona-based dialogue generation task, consistency and coherence are great challenges for language models.
A two-stage SimOAP strategy is proposed, i.e., over-sampling and post-evaluation.
arXiv Detail & Related papers (2023-05-18T17:23:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.