Leveraging Domain Adaptation and Data Augmentation to Improve Qur'anic
IR in English and Arabic
- URL: http://arxiv.org/abs/2312.02803v1
- Date: Tue, 5 Dec 2023 14:44:08 GMT
- Title: Leveraging Domain Adaptation and Data Augmentation to Improve Qur'anic
IR in English and Arabic
- Authors: Vera Pavlova
- Abstract summary: Training retrieval models requires a lot of data, which is difficult to obtain for training in-domain.
We employ a data augmentation technique, which considerably improved results in MRR@10 and NDCG@5 metrics.
The absence of an Islamic corpus and domain-specific model for IR task in English motivated us to address this lack of resources.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we approach the problem of Qur'anic information retrieval (IR)
in Arabic and English. Using the latest state-of-the-art methods in neural IR,
we research what helps to tackle this task more efficiently. Training retrieval
models requires a lot of data, which is difficult to obtain for training
in-domain. Therefore, we commence with training on a large amount of general
domain data and then continue training on in-domain data. To handle the lack of
in-domain data, we employed a data augmentation technique, which considerably
improved results in MRR@10 and NDCG@5 metrics, setting the state-of-the-art in
Qur'anic IR for both English and Arabic. The absence of an Islamic corpus and
domain-specific model for IR task in English motivated us to address this lack
of resources and take preliminary steps of the Islamic corpus compilation and
domain-specific language model (LM) pre-training, which helped to improve the
performance of the retrieval models that use the domain-specific LM as the
shared backbone. We examined several language models (LMs) in Arabic to select
one that efficiently deals with the Qur'anic IR task. Besides transferring
successful experiments from English to Arabic, we conducted additional
experiments with retrieval task in Arabic to amortize the scarcity of general
domain datasets used to train the retrieval models. Handling Qur'anic IR task
combining English and Arabic allowed us to enhance the comparison and share
valuable insights across models and languages.
Related papers
- Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval [0.0]
The research employs a language reduction technique to create a lightweight bilingual large language model (LLM)
Our approach for domain adaptation addresses the unique challenges faced in the Islamic domain, where substantial in-domain corpora exist only in Arabic.
The findings suggest that combining domain adaptation and a multi-stage training method for the bilingual Islamic neural retrieval model enables it to outperform monolingual models on downstream retrieval tasks.
arXiv Detail & Related papers (2025-01-17T13:17:42Z) - Resource-Aware Arabic LLM Creation: Model Adaptation, Integration, and Multi-Domain Testing [0.0]
This paper presents a novel approach to fine-tuning the Qwen2-1.5B model for Arabic language processing using Quantized Low-Rank Adaptation (QLoRA) on a system with only 4GB VRAM.
We detail the process of adapting this large language model to the Arabic domain, using diverse datasets including Bactrian, OpenAssistant, and Wikipedia Arabic corpora.
Experimental results over 10,000 training steps show significant performance improvements, with the final loss converging to 0.1083.
arXiv Detail & Related papers (2024-12-23T13:08:48Z) - Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.
One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.
Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z) - Building an Efficient Multilingual Non-Profit IR System for the Islamic Domain Leveraging Multiprocessing Design in Rust [0.0]
This work focuses on the development of a multilingual non-profit IR system for the Islamic domain.
By employing methods like continued pre-training for domain adaptation and language reduction to decrease model size, a lightweight multilingual retrieval model was prepared.
arXiv Detail & Related papers (2024-11-09T11:37:18Z) - Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language.
Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix.
By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z) - On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs)
We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora.
Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Improving Natural Language Inference in Arabic using Transformer Models
and Linguistically Informed Pre-Training [0.34998703934432673]
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP)
To overcome this limitation, we create a dedicated data set from publicly available resources.
We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches.
arXiv Detail & Related papers (2023-07-27T07:40:11Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.