Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval
- URL: http://arxiv.org/abs/2501.10175v1
- Date: Fri, 17 Jan 2025 13:17:42 GMT
- Title: Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval
- Authors: Vera Pavlova,
- Abstract summary: The research employs a language reduction technique to create a lightweight bilingual large language model (LLM)
Our approach for domain adaptation addresses the unique challenges faced in the Islamic domain, where substantial in-domain corpora exist only in Arabic.
The findings suggest that combining domain adaptation and a multi-stage training method for the bilingual Islamic neural retrieval model enables it to outperform monolingual models on downstream retrieval tasks.
- Score: 0.0
- License:
- Abstract: This study examines the use of Natural Language Processing (NLP) technology within the Islamic domain, focusing on developing an Islamic neural retrieval model. By leveraging the robust XLM-R model, the research employs a language reduction technique to create a lightweight bilingual large language model (LLM). Our approach for domain adaptation addresses the unique challenges faced in the Islamic domain, where substantial in-domain corpora exist only in Arabic while limited in other languages, including English. The work utilizes a multi-stage training process for retrieval models, incorporating large retrieval datasets, such as MS MARCO, and smaller, in-domain datasets to improve retrieval performance. Additionally, we have curated an in-domain retrieval dataset in English by employing data augmentation techniques and involving a reliable Islamic source. This approach enhances the domain-specific dataset for retrieval, leading to further performance gains. The findings suggest that combining domain adaptation and a multi-stage training method for the bilingual Islamic neural retrieval model enables it to outperform monolingual models on downstream retrieval tasks.
Related papers
- Resource-Aware Arabic LLM Creation: Model Adaptation, Integration, and Multi-Domain Testing [0.0]
This paper presents a novel approach to fine-tuning the Qwen2-1.5B model for Arabic language processing using Quantized Low-Rank Adaptation (QLoRA) on a system with only 4GB VRAM.
We detail the process of adapting this large language model to the Arabic domain, using diverse datasets including Bactrian, OpenAssistant, and Wikipedia Arabic corpora.
Experimental results over 10,000 training steps show significant performance improvements, with the final loss converging to 0.1083.
arXiv Detail & Related papers (2024-12-23T13:08:48Z) - Building an Efficient Multilingual Non-Profit IR System for the Islamic Domain Leveraging Multiprocessing Design in Rust [0.0]
This work focuses on the development of a multilingual non-profit IR system for the Islamic domain.
By employing methods like continued pre-training for domain adaptation and language reduction to decrease model size, a lightweight multilingual retrieval model was prepared.
arXiv Detail & Related papers (2024-11-09T11:37:18Z) - Leveraging Domain Adaptation and Data Augmentation to Improve Qur'anic
IR in English and Arabic [0.0]
Training retrieval models requires a lot of data, which is difficult to obtain for training in-domain.
We employ a data augmentation technique, which considerably improved results in MRR@10 and NDCG@5 metrics.
The absence of an Islamic corpus and domain-specific model for IR task in English motivated us to address this lack of resources.
arXiv Detail & Related papers (2023-12-05T14:44:08Z) - QAGAN: Adversarial Approach To Learning Domain Invariant Language
Features [0.76146285961466]
We explore adversarial training approach towards learning domain-invariant features.
We are able to achieve $15.2%$ improvement in EM score and $5.6%$ boost in F1 score on out-of-domain validation dataset.
arXiv Detail & Related papers (2022-06-24T17:42:18Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Reprogramming Language Models for Molecular Representation Learning [65.00999660425731]
We propose Representation Reprogramming via Dictionary Learning (R2DL) for adversarially reprogramming pretrained language models for molecular learning tasks.
The adversarial program learns a linear transformation between a dense source model input space (language data) and a sparse target model input space (e.g., chemical and biological molecule data) using a k-SVD solver.
R2DL achieves the baseline established by state of the art toxicity prediction models trained on domain-specific data and outperforms the baseline in a limited training-data setting.
arXiv Detail & Related papers (2020-12-07T05:50:27Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.