Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages
- URL: http://arxiv.org/abs/2210.09984v1
- Date: Tue, 18 Oct 2022 16:47:18 GMT
- Title: Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages
- Authors: Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David
Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, Jimmy Lin
- Abstract summary: MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
- Score: 62.730361829175415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: MIRACL (Multilingual Information Retrieval Across a Continuum of Languages)
is a multilingual dataset we have built for the WSDM 2023 Cup challenge that
focuses on ad hoc retrieval across 18 different languages, which collectively
encompass over three billion native speakers around the world. These languages
have diverse typologies, originate from many different language families, and
are associated with varying amounts of available resources -- including what
researchers typically characterize as high-resource as well as low-resource
languages. Our dataset is designed to support the creation and evaluation of
models for monolingual retrieval, where the queries and the corpora are in the
same language. In total, we have gathered over 700k high-quality relevance
judgments for around 77k queries over Wikipedia in these 18 languages, where
all assessments have been performed by native speakers hired by our team. Our
goal is to spur research that will improve retrieval across a continuum of
languages, thus enhancing information access capabilities for diverse
populations around the world, particularly those that have been traditionally
underserved. This overview paper describes the dataset and baselines that we
share with the community. The MIRACL website is live at http://miracl.ai/.
Related papers
- Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages [55.36534539177367]
This paper introduces Pangea, a multilingual multimodal large language model (MLLM) trained on a diverse 6M instruction dataset spanning 39 languages.
P Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts.
We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs.
arXiv Detail & Related papers (2024-10-21T16:19:41Z) - M2DS: Multilingual Dataset for Multi-document Summarisation [0.5071800070021028]
Multi-document Summarisation (MDS) has resulted in diverse datasets covering customer reviews, academic papers, medical and legal documents, and news articles.
However, the English-centric nature of these datasets has created a conspicuous void for multilingual datasets in today's globalised digital landscape.
This paper introduces M2DS, emphasising its unique multilingual aspect, and includes baseline scores from state-of-the-art MDS models evaluated on our dataset.
arXiv Detail & Related papers (2024-07-17T06:25:51Z) - Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model [14.39119862985503]
We aim to create a multilingual ALT system with available datasets.
Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario.
We evaluate the performance of the multilingual model in comparison to its monolingual counterparts.
arXiv Detail & Related papers (2024-06-25T15:02:32Z) - CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Breaking Language Barriers: A Question Answering Dataset for Hindi and
Marathi [1.03590082373586]
This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi.
Despite Hindi being the 3rd most spoken language worldwide, and Marathi being the 11th most spoken language globally, both languages face limited resources for building efficient Question Answering systems.
We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples.
arXiv Detail & Related papers (2023-08-19T00:39:21Z) - GlobalBench: A Benchmark for Global Progress in Natural Language
Processing [114.24519009839142]
GlobalBench aims to track progress on all NLP datasets in all languages.
Tracks estimated per-speaker utility and equity of technology across all languages.
Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages.
arXiv Detail & Related papers (2023-05-24T04:36:32Z) - UIO at SemEval-2023 Task 12: Multilingual fine-tuning for sentiment
classification in low-resource languages [0.0]
We show how a multilingual large language model can be a resource for sentiment analysis in languages not seen during pretraining.
The languages are to various degrees related to languages used during pretraining, and the language data contain various degrees of code-switching.
We experiment with both monolingual and multilingual datasets for the final fine-tuning, and find that with the provided datasets that contain samples in the thousands, monolingual fine-tuning yields the best results.
arXiv Detail & Related papers (2023-04-27T13:51:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.