Multi-EuP: The Multilingual European Parliament Dataset for Analysis of
Bias in Information Retrieval
- URL: http://arxiv.org/abs/2311.01870v1
- Date: Fri, 3 Nov 2023 12:29:11 GMT
- Title: Multi-EuP: The Multilingual European Parliament Dataset for Analysis of
Bias in Information Retrieval
- Authors: Jinrui Yang, Timothy Baldwin, Trevor Cohn
- Abstract summary: This dataset is designed to investigate fairness in a multilingual information retrieval context.
It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages.
It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
- Score: 62.82448161570428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K
multi-lingual documents collected from the European Parliament, spanning 24
languages. This dataset is designed to investigate fairness in a multilingual
information retrieval (IR) context to analyze both language and demographic
bias in a ranking context. It boasts an authentic multilingual corpus,
featuring topics translated into all 24 languages, as well as cross-lingual
relevance judgments. Furthermore, it offers rich demographic information
associated with its documents, facilitating the study of demographic bias. We
report the effectiveness of Multi-EuP for benchmarking both monolingual and
multilingual IR. We also conduct a preliminary experiment on language bias
caused by the choice of tokenization strategy.
Related papers
- Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation [38.631934251052485]
We evaluate six multilingual RALMs using our benchmark to explore the challenges of multilingual RALMs.
High-resource languages stand out in Monolingual Knowledge Extraction.
Indo-European languages lead RALMs to provide answers directly from documents.
English benefits from RALMs' selection bias and speaks louder in multilingual knowledge selection.
arXiv Detail & Related papers (2024-10-29T11:53:19Z) - No Language is an Island: Unifying Chinese and English in Financial Large Language Models, Instruction Data, and Benchmarks [75.29561463156635]
ICE-PIXIU uniquely integrates a spectrum of Chinese tasks, alongside translated and original English datasets.
It provides unrestricted access to diverse model variants, a compilation of diverse cross-lingual and multi-modal instruction data, and an evaluation benchmark with expert annotations.
arXiv Detail & Related papers (2024-03-10T16:22:20Z) - A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets [1.1647644386277962]
Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP.
We propose assessing linguistic diversity of a data set against a reference language sample.
arXiv Detail & Related papers (2024-03-06T18:14:22Z) - Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions? [42.37657013017192]
We show that instruction-tuning on parallel instead of monolingual corpora benefits cross-lingual instruction following capabilities by up to 9.9%.
We also conduct a human annotation study to understand the alignment between human-based and GPT-4-based evaluation within multilingual chat scenarios.
arXiv Detail & Related papers (2024-02-21T11:07:07Z) - Towards a Deep Understanding of Multilingual End-to-End Speech
Translation [52.26739715012842]
We analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages.
We derive three major findings from our analysis.
arXiv Detail & Related papers (2023-10-31T13:50:55Z) - Soft Prompt Decoding for Multilingual Dense Retrieval [30.766917713997355]
We show that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance.
This is due to the heterogeneous and imbalanced nature of multilingual collections.
We present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space.
arXiv Detail & Related papers (2023-05-15T21:17:17Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.