Building an Efficient Multilingual Non-Profit IR System for the Islamic Domain Leveraging Multiprocessing Design in Rust
- URL: http://arxiv.org/abs/2411.06151v1
- Date: Sat, 09 Nov 2024 11:37:18 GMT
- Title: Building an Efficient Multilingual Non-Profit IR System for the Islamic Domain Leveraging Multiprocessing Design in Rust
- Authors: Vera Pavlova, Mohammed Makhlouf,
- Abstract summary: This work focuses on the development of a multilingual non-profit IR system for the Islamic domain.
By employing methods like continued pre-training for domain adaptation and language reduction to decrease model size, a lightweight multilingual retrieval model was prepared.
- Score: 0.0
- License:
- Abstract: The widespread use of large language models (LLMs) has dramatically improved many applications of Natural Language Processing (NLP), including Information Retrieval (IR). However, domains that are not driven by commercial interest often lag behind in benefiting from AI-powered solutions. One such area is religious and heritage corpora. Alongside similar domains, Islamic literature holds significant cultural value and is regularly utilized by scholars and the general public. Navigating this extensive amount of text is challenging, and there is currently no unified resource that allows for easy searching of this data using advanced AI tools. This work focuses on the development of a multilingual non-profit IR system for the Islamic domain. This process brings a few major challenges, such as preparing multilingual domain-specific corpora when data is limited in certain languages, deploying a model on resource-constrained devices, and enabling fast search on a limited budget. By employing methods like continued pre-training for domain adaptation and language reduction to decrease model size, a lightweight multilingual retrieval model was prepared, demonstrating superior performance compared to larger models pre-trained on general domain data. Furthermore, evaluating the proposed architecture that utilizes Rust Language capabilities shows the possibility of implementing efficient semantic search in a low-resource setting.
Related papers
- Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.
For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.
We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval [0.0]
The research employs a language reduction technique to create a lightweight bilingual large language model (LLM)
Our approach for domain adaptation addresses the unique challenges faced in the Islamic domain, where substantial in-domain corpora exist only in Arabic.
The findings suggest that combining domain adaptation and a multi-stage training method for the bilingual Islamic neural retrieval model enables it to outperform monolingual models on downstream retrieval tasks.
arXiv Detail & Related papers (2025-01-17T13:17:42Z) - Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation [2.9921619703037274]
We propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing.
We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM.
We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages.
arXiv Detail & Related papers (2024-10-01T04:20:14Z) - Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts.
Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models.
The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z) - ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot
Multilingual Information Retrieval [10.664434993386523]
Current approaches circumvent the lack of high-quality labeled data in non-English languages.
We present a novel modular dense retrieval model that learns from the rich data of a single high-resource language.
arXiv Detail & Related papers (2024-02-23T02:21:24Z) - Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods.
Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - How Low is Too Low? A Computational Perspective on Extremely
Low-Resource Languages [1.7625363344837164]
We introduce the first cross-lingual information extraction pipeline for Sumerian.
We also curate InterpretLR, an interpretability toolkit for low-resource NLP.
Most components of our pipeline can be generalised to any other language to obtain an interpretable execution.
arXiv Detail & Related papers (2021-05-30T12:09:59Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.