Related papers: Optimizing RAG Pipelines for Arabic: A Systematic Analysis of Core Components

Optimizing RAG Pipelines for Arabic: A Systematic Analysis of Core Components

URL: http://arxiv.org/abs/2506.06339v1
Date: Sun, 01 Jun 2025 00:04:58 GMT
Title: Optimizing RAG Pipelines for Arabic: A Systematic Analysis of Core Components
Authors: Jumana Alsubhi, Mohammad D. Alahmadi, Ahmed Alhusayni, Ibrahim Aldailami, Israa Hamdine, Ahmad Shabana, Yazeed Iskandar, Suhayb Khayyat,
Abstract summary: Retrieval-Augmented Generation (RAG) has emerged as a powerful architecture for combining the precision of retrieval systems with the fluency of large language models.<n>This study presents a comprehensive empirical evaluation of state-of-the-art RAG components-including chunking strategies, embedding models, rerankers, and language models-across a diverse set of Arabic datasets.
Score: 0.0
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful architecture for combining the precision of retrieval systems with the fluency of large language models. While several studies have investigated RAG pipelines for high-resource languages, the optimization of RAG components for Arabic remains underexplored. This study presents a comprehensive empirical evaluation of state-of-the-art RAG components-including chunking strategies, embedding models, rerankers, and language models-across a diverse set of Arabic datasets. Using the RAGAS framework, we systematically compare performance across four core metrics: context precision, context recall, answer faithfulness, and answer relevancy. Our experiments demonstrate that sentence-aware chunking outperforms all other segmentation methods, while BGE-M3 and Multilingual-E5-large emerge as the most effective embedding models. The inclusion of a reranker (bge-reranker-v2-m3) significantly boosts faithfulness in complex datasets, and Aya-8B surpasses StableLM in generation quality. These findings provide critical insights for building high-quality Arabic RAG pipelines and offer practical guidelines for selecting optimal components across different document types.

Related papers

CrosGrpsABS: Cross-Attention over Syntactic and Semantic Graphs for Aspect-Based Sentiment Analysis in a Low-Resource Language [0.5937476291232802]
Aspect-Based Sentiment Analysis (ABSA) is a fundamental task in natural language processing, offering fine-grained insights into opinions expressed in text.<n>This research propose CrosGrpsABS, a novel hybrid framework that leverages bidirectional cross-attention between syntactic and semantic graphs to enhance aspect-level sentiment classification.<n>We evaluate CrosGrpsABS on four low-resource Bengali ABSA datasets and the high-resource English SemEval 2014 Task 4 dataset.
arXiv Detail & Related papers (2025-05-25T07:42:32Z)
Fine-Tuning Large Language Models and Evaluating Retrieval Methods for Improved Question Answering on Building Codes [0.0]
Building codes are regulations that establish standards for the design, construction, and safety of buildings to ensure structural integrity, fire protection, and accessibility.<n>Key difficulties include navigating large volumes of text, interpreting technical language, and identifying relevant clauses across different sections.<n>A potential solution is to build a Question-Answering (QA) system that answers user queries based on building codes.<n>Among the various methods for building a QA system, Retrieval-Augmented Generation (RAG) stands out in performance.
arXiv Detail & Related papers (2025-05-07T05:04:30Z)
Context-Guided Dynamic Retrieval for Improving Generation Quality in RAG Models [2.9687381456164004]
It proposes a state-aware dynamic knowledge retrieval mechanism to enhance semantic understanding and knowledge scheduling efficiency.<n>The proposed structure is thoroughly evaluated across different large models, including GPT-4, GPT-4o, and DeepSeek.<n>The approach also demonstrates stronger robustness and generation consistency in tasks involving semantic ambiguity and multi-document fusion.
arXiv Detail & Related papers (2025-04-28T02:50:45Z)
AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset [95.45316956434608]
Preference learning is critical for aligning large language models with human values.<n>Our work shifts preference dataset design from ad hoc scaling to component-aware optimization.
arXiv Detail & Related papers (2025-04-04T17:33:07Z)
Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization [0.0]
The research aims to improve retrieval and generation accuracy by introducing Persian-specific models.<n>Three datasets-general knowledge(PQuad), scientifically specialized texts, and organizational reports- were used to assess these models.<n>MatinaSRoberta outperformed previous embeddings, achieving superior contextual relevance and retrieval accuracy across datasets.
arXiv Detail & Related papers (2025-01-08T22:16:40Z)
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.<n>Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.<n>We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.<n>Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z)
RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation [8.377398103067508]
We introduce RAG Foundry, an open-source framework for augmenting large language models for RAG use cases. RAG Foundry integrates data creation, training, inference and evaluation into a single workflow. We demonstrate the framework effectiveness by augmenting and fine-tuning Llama-3 and Phi-3 models with diverse RAG configurations.
arXiv Detail & Related papers (2024-08-05T15:16:24Z)
FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research [70.6584488911715]
retrieval-augmented generation (RAG) has attracted considerable research attention.<n>Existing RAG toolkits are often heavy and inflexibly, failing to meet the customization needs of researchers.<n>Our toolkit has implemented 16 advanced RAG methods and gathered and organized 38 benchmark datasets.
arXiv Detail & Related papers (2024-05-22T12:12:40Z)
CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models [49.16989035566899]
Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources. This paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios.
arXiv Detail & Related papers (2024-01-30T14:25:32Z)
Contextualization Distillation from Large Language Model for Knowledge Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks. Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments. Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z)
SimOAP: Improve Coherence and Consistency in Persona-based Dialogue Generation via Over-sampling and Post-evaluation [54.66399120084227]
Language models trained on large-scale corpora can generate remarkably fluent results in open-domain dialogue. For the persona-based dialogue generation task, consistency and coherence are great challenges for language models. A two-stage SimOAP strategy is proposed, i.e., over-sampling and post-evaluation.
arXiv Detail & Related papers (2023-05-18T17:23:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.