CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
- URL: http://arxiv.org/abs/2504.16264v1
- Date: Tue, 22 Apr 2025 20:55:08 GMT
- Title: CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
- Authors: Francisco Valentini, Diego Kozlowski, Vincent Larivière,
- Abstract summary: This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search.<n>The dataset is built using bilingual article metadata from 'Erudit, a Canadian publishing platform.
- Score: 2.0277446818410994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from \'Erudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
Related papers
- mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval [61.17793165194077]
We introduce mFollowIR, a benchmark for measuring instruction-following ability in retrieval models.<n>We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance.<n>We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting.
arXiv Detail & Related papers (2025-01-31T16:24:46Z) - Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness [30.00463676754559]
We introduce BordIRLines, a benchmark consisting of 720 territorial dispute queries paired with 14k Wikipedia documents across 49 languages.<n>Our experiments reveal that retrieving multilingual documents best improves response consistency and decreases geopolitical bias over using purely in-language documents.<n>Our further experiments and case studies investigate how cross-lingual RAG is affected by aspects from IR to document contents.
arXiv Detail & Related papers (2024-10-02T01:59:07Z) - Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling [32.10366004426449]
This paper introduces UMR, an Unsupervised dense Multilingual Retriever trained without any paired data.
We propose a two-stage framework which iteratively improves the performance of multilingual dense retrievers.
arXiv Detail & Related papers (2024-03-06T07:49:06Z) - Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval [56.65147231836708]
We develop SWIM-IR, a synthetic retrieval training dataset containing 33 languages for fine-tuning multilingual dense retrievers.
SAP assists the large language model (LLM) in generating informative queries in the target language.
Our models, called SWIM-X, are competitive with human-supervised dense retrieval models.
arXiv Detail & Related papers (2023-11-10T00:17:10Z) - Soft Prompt Decoding for Multilingual Dense Retrieval [30.766917713997355]
We show that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance.
This is due to the heterogeneous and imbalanced nature of multilingual collections.
We present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space.
arXiv Detail & Related papers (2023-05-15T21:17:17Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual
Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages.
We present the first fact-checking framework augmented with crosslingual retrieval.
We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z) - Cross-Lingual Phrase Retrieval [49.919180978902915]
Cross-lingual retrieval aims to retrieve relevant text across languages.
Current methods typically achieve cross-lingual retrieval by learning language-agnostic text representations in word or sentence level.
We propose XPR, a cross-lingual phrase retriever that extracts phrase representations from unlabeled example sentences.
arXiv Detail & Related papers (2022-04-19T13:35:50Z) - Mind the Gap: Cross-Lingual Information Retrieval with Hierarchical
Knowledge Enhancement [28.99870384344861]
Cross-Lingual Information Retrieval aims to rank documents written in a language different from the user's query.
We introduce the multilingual knowledge graph (KG) to the CLIR task due to the sufficient information of entities in multiple languages.
We propose a model named CLIR with hierarchical knowledge enhancement (HIKE) for our task.
arXiv Detail & Related papers (2021-12-27T04:56:30Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Cross-Lingual Document Retrieval with Smooth Learning [31.638708227607214]
Cross-lingual document search is an information retrieval task in which the queries' language differs from the documents' language.
We propose a novel end-to-end robust framework that achieves improved performance in cross-lingual search with different documents' languages.
arXiv Detail & Related papers (2020-11-02T03:17:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.