Related papers: dziribot: rag based intelligent conversational agent for algerian arabic dialect

dziribot: rag based intelligent conversational agent for algerian arabic dialect

URL: http://arxiv.org/abs/2602.02270v1
Date: Mon, 02 Feb 2026 16:11:32 GMT
Title: dziribot: rag based intelligent conversational agent for algerian arabic dialect
Authors: El Batoul Bechiri, Dihia Lanasri,
Abstract summary: This paper introduces DziriBOT, a hybrid intelligent conversational agent specifically engineered to overcome these challenges.<n>We propose a multi-layered architecture that integrates specialized Natural Language Understanding (NLU) with Retrieval-Augmented Generation (RAG)<n>Our experimental results demonstrate that the fine-tuned DziriBERT model achieves state-of-the-art performance.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The rapid digitalization of customer service has intensified the demand for conversational agents capable of providing accurate and natural interactions. In the Algerian context, this is complicated by the linguistic complexity of Darja, a dialect characterized by non-standardized orthography, extensive code-switching with French, and the simultaneous use of Arabic and Latin (Arabizi) scripts. This paper introduces DziriBOT, a hybrid intelligent conversational agent specifically engineered to overcome these challenges. We propose a multi-layered architecture that integrates specialized Natural Language Understanding (NLU) with Retrieval-Augmented Generation (RAG), allowing for both structured service flows and dynamic, knowledge-intensive responses grounded in curated enterprise documentation. To address the low-resource nature of Darja, we systematically evaluate three distinct approaches: a sparse-feature Rasa pipeline, classical machine learning baselines, and transformer-based fine-tuning. Our experimental results demonstrate that the fine-tuned DziriBERT model achieves state-of-the-art performance. These results significantly outperform traditional baselines, particularly in handling orthographic noise and rare intents. Ultimately, DziriBOT provides a robust, scalable solution that bridges the gap between formal language models and the linguistic realities of Algerian users, offering a blueprint for dialect-aware automation in the regional market.

Related papers

Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis [20.50741854108831]
We present Habibi, a suite of specialized and unified text-to-speech models.<n>Our approach outperforms the leading commercial service in generation quality.<n>We create the first systematic benchmark for multi-dialect Arabic speech synthesis.
arXiv Detail & Related papers (2026-01-20T10:02:11Z)
Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning [0.0]
We present a scalable training pipeline that combines weakly supervised learning with supervised fine-tuning to develop a robust Arabic ASR model.<n>Our approach achieves state-of-the-art results, ranking first in the multi-dialectal Arabic ASR challenge.
arXiv Detail & Related papers (2025-08-12T13:02:22Z)
Enhanced Arabic Text Retrieval with Attentive Relevance Scoring [12.053940320312355]
Arabic poses a particular challenge for natural language processing and information retrieval.<n>Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources.<n>We present an enhanced Dense Passage Retrieval framework developed specifically for Arabic.
arXiv Detail & Related papers (2025-07-31T10:18:28Z)
Efficient Multilingual ASR Finetuning via LoRA Language Experts [59.27778147311189]
This paper proposes an efficient finetuning framework for customized multilingual ASR via prepared LoRA language experts based on Whisper.<n>Through LoRA expert fusion or knowledge distillation, our approach achieves better recognition performance on target languages than standard fine-tuning methods.<n> Experimental results demonstrate that the proposed models yield approximately 10% and 15% relative performance gains in language-aware and language-agnostic scenarios.
arXiv Detail & Related papers (2025-06-11T07:06:27Z)
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.<n>Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.<n>We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z)
DarijaBanking: A New Resource for Overcoming Language Barriers in Banking Intent Detection for Moroccan Arabic Speakers [5.274804664403783]
Navigating the complexities of language diversity is a central challenge in developing robust natural language processing systems. This paper introduces textbfDarijaBanking, a novel Darija dataset aimed at enhancing intent classification in the banking domain. DarijaBanking comprises over 1,800 parallel high-quality queries in Darija, Modern Standard Arabic (MSA), English, and French, organized into 24 intent classes.
arXiv Detail & Related papers (2024-05-26T08:33:28Z)
Arabic Sentiment Analysis with Noisy Deep Explainable Model [48.22321420680046]
This paper proposes an explainable sentiment classification framework for the Arabic language. The proposed framework can explain specific predictions by training a local surrogate explainable model. We carried out experiments on public benchmark Arabic SA datasets.
arXiv Detail & Related papers (2023-09-24T19:26:53Z)
Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
The Whole Truth and Nothing But the Truth: Faithful and Controllable Dialogue Response Generation with Dataflow Transduction and Constrained Decoding [65.34601470417967]
We describe a hybrid architecture for dialogue response generation that combines the strengths of neural language modeling and rule-based generation. Our experiments show that this system outperforms both rule-based and learned approaches in human evaluations of fluency, relevance, and truthfulness.
arXiv Detail & Related papers (2022-09-16T09:00:49Z)
MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning [50.40151403246205]
Huge language models (LMs) have ushered in a new era for AI, serving as a gateway to natural-language-based knowledge tasks. We define a flexible architecture with multiple neural models, complemented by discrete knowledge and reasoning modules. We describe this neuro-symbolic architecture, dubbed the Modular Reasoning, Knowledge and Language (MRKL) system.
arXiv Detail & Related papers (2022-05-01T11:01:28Z)
Gaussian Smoothen Semantic Features (GSSF) -- Exploring the Linguistic Aspects of Visual Captioning in Indian Languages (Bengali) Using MSCOCO Framework [9.89901717499058]
In this work, we have introduced Gaussian Smoothen Semantic Features (GSSF) for Better Semantic Selection for Indian regional language-based image captioning. We also introduced a procedure where we used the existing translation and English crowd-sourced sentences for training. Our main contribution of this work is the development of deep learning architectures for the Bengali language.
arXiv Detail & Related papers (2020-02-16T23:03:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.