Related papers: Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

URL: http://arxiv.org/abs/2505.19494v1
Date: Mon, 26 May 2025 04:23:21 GMT
Title: Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents
Authors: Manoj Balaji Jagadeeshan, Prince Raj, Pawan Goyal,
Abstract summary: The study fine-tunes state-of-the-art models for Sanskrit's linguistic nuances.<n>It adapts summarization techniques for Sanskrit documents to improve QA processing.<n>A dataset of 3,400 English-Sanskrit query-document pairs underpins the study.
Score: 7.967320126793103
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The study presents a comprehensive benchmark for retrieving Sanskrit documents using English queries, focusing on the chapters of the Srimadbhagavatam. It employs a tripartite approach: Direct Retrieval (DR), Translation-based Retrieval (DT), and Query Translation (QT), utilizing shared embedding spaces and advanced translation methods to enhance retrieval systems in a RAG framework. The study fine-tunes state-of-the-art models for Sanskrit's linguistic nuances, evaluating models such as BM25, REPLUG, mDPR, ColBERT, Contriever, and GPT-2. It adapts summarization techniques for Sanskrit documents to improve QA processing. Evaluation shows DT methods outperform DR and QT in handling the cross-lingual challenges of ancient texts, improving accessibility and understanding. A dataset of 3,400 English-Sanskrit query-document pairs underpins the study, aiming to preserve Sanskrit scriptures and share their philosophical importance widely. Our dataset is publicly available at https://huggingface.co/datasets/manojbalaji1/anveshana

Related papers

IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems [17.88837706307504]
IndicMSMarco is a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages.<n>We build a large-scale dataset of (question, answer, relevant passage)s derived from the Wikipedias of 19 Indian languages using state-of-the-art LLMs.
arXiv Detail & Related papers (2025-06-02T12:55:51Z)
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents [2.0277446818410994]
This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search.<n>The dataset is built using bilingual article metadata from 'Erudit, a Canadian publishing platform.
arXiv Detail & Related papers (2025-04-22T20:55:08Z)
Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness [30.00463676754559]
We introduce BordIRLines, a benchmark consisting of 720 territorial dispute queries paired with 14k Wikipedia documents across 49 languages.<n>Our experiments reveal that retrieving multilingual documents best improves response consistency and decreases geopolitical bias over using purely in-language documents.<n>Our further experiments and case studies investigate how cross-lingual RAG is affected by aspects from IR to document contents.
arXiv Detail & Related papers (2024-10-02T01:59:07Z)
Hindi-BEIR : A Large Scale Retrieval Benchmark in Hindi [8.21020989074456]
Despite ongoing research, there is a lack of comprehensive benchmark for evaluating retrieval models in Hindi. We introduce the Hindi version of the BEIR benchmark, which includes a subset of English BEIR datasets translated to Hindi, existing Hindi retrieval datasets, and synthetically created datasets for retrieval. We evaluate state-of-the-art multilingual retrieval models on this benchmark to identify task and domain-specific challenges and their impact on retrieval performance.
arXiv Detail & Related papers (2024-08-18T10:55:04Z)
MACT: Model-Agnostic Cross-Lingual Training for Discourse Representation Structure Parsing [4.536003573070846]
We introduce a cross-lingual training strategy for semantic representation parsing models. It exploits the alignments between languages encoded in pre-trained language models. Experiments show significant improvements in DRS clause and graph parsing in English, German, Italian and Dutch.
arXiv Detail & Related papers (2024-06-03T07:02:57Z)
Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer [23.978072734886272]
This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages. We construct a dataset by transferring an English dataset to Japanese. We investigate if the transferred dataset can assist human annotation on Japanese documents.
arXiv Detail & Related papers (2024-04-25T10:59:02Z)
Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval. We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning. On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z)
PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages. We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts. We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z)
Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval [50.882816288076725]
Cross-lingual information retrieval is the task of searching documents in one language with queries in another. We provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold. We implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese.
arXiv Detail & Related papers (2023-04-03T14:17:00Z)
Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language. We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs. We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)
Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval [51.004601358498135]
Mr. TyDi is a benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages. The goal of this resource is to spur research in dense retrieval techniques in non-English languages.
arXiv Detail & Related papers (2021-08-19T16:53:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.