Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents
- URL: http://arxiv.org/abs/2505.19494v1
- Date: Mon, 26 May 2025 04:23:21 GMT
- Title: Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents
- Authors: Manoj Balaji Jagadeeshan, Prince Raj, Pawan Goyal,
- Abstract summary: The study fine-tunes state-of-the-art models for Sanskrit's linguistic nuances.<n>It adapts summarization techniques for Sanskrit documents to improve QA processing.<n>A dataset of 3,400 English-Sanskrit query-document pairs underpins the study.
- Score: 7.967320126793103
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The study presents a comprehensive benchmark for retrieving Sanskrit documents using English queries, focusing on the chapters of the Srimadbhagavatam. It employs a tripartite approach: Direct Retrieval (DR), Translation-based Retrieval (DT), and Query Translation (QT), utilizing shared embedding spaces and advanced translation methods to enhance retrieval systems in a RAG framework. The study fine-tunes state-of-the-art models for Sanskrit's linguistic nuances, evaluating models such as BM25, REPLUG, mDPR, ColBERT, Contriever, and GPT-2. It adapts summarization techniques for Sanskrit documents to improve QA processing. Evaluation shows DT methods outperform DR and QT in handling the cross-lingual challenges of ancient texts, improving accessibility and understanding. A dataset of 3,400 English-Sanskrit query-document pairs underpins the study, aiming to preserve Sanskrit scriptures and share their philosophical importance widely. Our dataset is publicly available at https://huggingface.co/datasets/manojbalaji1/anveshana
Related papers
- IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems [17.88837706307504]
IndicMSMarco is a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages.<n>We build a large-scale dataset of (question, answer, relevant passage)s derived from the Wikipedias of 19 Indian languages using state-of-the-art LLMs.
arXiv Detail & Related papers (2025-06-02T12:55:51Z) - CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents [2.0277446818410994]
This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search.<n>The dataset is built using bilingual article metadata from 'Erudit, a Canadian publishing platform.
arXiv Detail & Related papers (2025-04-22T20:55:08Z) - Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness [30.00463676754559]
We introduce BordIRLines, a benchmark consisting of 720 territorial dispute queries paired with 14k Wikipedia documents across 49 languages.<n>Our experiments reveal that retrieving multilingual documents best improves response consistency and decreases geopolitical bias over using purely in-language documents.<n>Our further experiments and case studies investigate how cross-lingual RAG is affected by aspects from IR to document contents.
arXiv Detail & Related papers (2024-10-02T01:59:07Z) - Hindi-BEIR : A Large Scale Retrieval Benchmark in Hindi [8.21020989074456]
Despite ongoing research, there is a lack of comprehensive benchmark for evaluating retrieval models in Hindi.
We introduce the Hindi version of the BEIR benchmark, which includes a subset of English BEIR datasets translated to Hindi, existing Hindi retrieval datasets, and synthetically created datasets for retrieval.
We evaluate state-of-the-art multilingual retrieval models on this benchmark to identify task and domain-specific challenges and their impact on retrieval performance.
arXiv Detail & Related papers (2024-08-18T10:55:04Z) - MACT: Model-Agnostic Cross-Lingual Training for Discourse Representation Structure Parsing [4.536003573070846]
We introduce a cross-lingual training strategy for semantic representation parsing models.
It exploits the alignments between languages encoded in pre-trained language models.
Experiments show significant improvements in DRS clause and graph parsing in English, German, Italian and Dutch.
arXiv Detail & Related papers (2024-06-03T07:02:57Z) - Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer [23.978072734886272]
This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages.
We construct a dataset by transferring an English dataset to Japanese.
We investigate if the transferred dataset can assist human annotation on Japanese documents.
arXiv Detail & Related papers (2024-04-25T10:59:02Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Simple Yet Effective Neural Ranking and Reranking Baselines for
Cross-Lingual Information Retrieval [50.882816288076725]
Cross-lingual information retrieval is the task of searching documents in one language with queries in another.
We provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold.
We implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese.
arXiv Detail & Related papers (2023-04-03T14:17:00Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval [51.004601358498135]
Mr. TyDi is a benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages.
The goal of this resource is to spur research in dense retrieval techniques in non-English languages.
arXiv Detail & Related papers (2021-08-19T16:53:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.