FIBER: A Multilingual Evaluation Resource for Factual Inference Bias
- URL: http://arxiv.org/abs/2512.11110v1
- Date: Thu, 11 Dec 2025 20:51:16 GMT
- Title: FIBER: A Multilingual Evaluation Resource for Factual Inference Bias
- Authors: Evren Ayberk Munis, Deniz Yılmaz, Arianna Muti, Çağrı Toraman,
- Abstract summary: We present FIBER, a benchmark for evaluating factual knowledge in single- and multi-entity settings.<n>The dataset includes sentence completion, question-answering, and object-count prediction tasks in English, Italian, and Turkish.<n>Using FIBER, we examine whether the prompt language induces inference bias in entity selection.
- Score: 3.128106382761961
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models are widely used across domains, yet there are concerns about their factual reliability and biases. Factual knowledge probing offers a systematic means to evaluate these aspects. Most existing benchmarks focus on single-entity facts and monolingual data. We therefore present FIBER, a multilingual benchmark for evaluating factual knowledge in single- and multi-entity settings. The dataset includes sentence completion, question-answering, and object-count prediction tasks in English, Italian, and Turkish. Using FIBER, we examine whether the prompt language induces inference bias in entity selection and how large language models perform on multi-entity versus single-entity questions. The results indicate that the language of the prompt can influence the model's generated output, particularly for entities associated with the country corresponding to that language. However, this effect varies across different topics such that 31% of the topics exhibit factual inference bias score greater than 0.5. Moreover, the level of bias differs across languages such that Turkish prompts show higher bias compared to Italian in 83% of the topics, suggesting a language-dependent pattern. Our findings also show that models face greater difficulty when handling multi-entity questions than the single-entity questions. Model performance differs across both languages and model sizes. The highest mean average precision is achieved in English, while Turkish and Italian lead to noticeably lower scores. Larger models, including Llama-3.1-8B and Qwen-2.5-7B, show consistently better performance than smaller 3B-4B models.
Related papers
- Cross-Language Bias Examination in Large Language Models [37.21579885190632]
This study introduces an innovative multilingual bias evaluation framework for assessing bias in Large Language Models.<n>By translating the prompts and word list into five target languages, we compare different types of bias across languages.<n>For example, Arabic and Spanish consistently show higher levels of stereotype bias, while Chinese and English exhibit lower levels of bias.
arXiv Detail & Related papers (2025-12-17T23:22:03Z) - Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG [55.258582772528506]
We investigate whether the mixture of different document languages impacts generation and citation in unintended ways.<n>Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English.<n>We find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone.
arXiv Detail & Related papers (2025-09-17T12:58:18Z) - Quantifying Language Disparities in Multilingual Large Language Models [31.198046729180266]
Results reported in large-scale multilingual evaluations are often fragmented and confounded by factors such as target languages, differences in experimental setups, and model choices.<n>We propose a framework that disentangles these confounding variables and introduces three interpretable metrics--the performance realisation ratio, its coefficient of variation, and language potential.
arXiv Detail & Related papers (2025-08-23T23:25:38Z) - Beyond Early-Token Bias: Model-Specific and Language-Specific Position Effects in Multilingual LLMs [50.07451351559251]
We present a study across five typologically distinct languages (English, Russian, German, Hindi, and Vietnamese)<n>We examine how position bias interacts with prompt strategies and affects output entropy.
arXiv Detail & Related papers (2025-05-22T02:23:00Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You [64.74707085021858]
We show that multilingual models suffer from significant gender biases just as monolingual models do.<n>We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models.<n>Our results show that not only do models exhibit strong gender biases but they also behave differently across languages.
arXiv Detail & Related papers (2024-01-29T12:02:28Z) - SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning [44.53966523376327]
SeaEval is a benchmark for multilingual foundation models.
We characterize how these models understand and reason with natural language.
We also investigate how well they comprehend cultural practices, nuances, and values.
arXiv Detail & Related papers (2023-09-09T11:42:22Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - An Empirical Study of Factors Affecting Language-Independent Models [11.976665726887733]
We show that language-independent models can be comparable to or even outperforms the models trained using monolingual data.
We experiment language-independent models with many different languages and show that they are more suitable for typologically similar languages.
arXiv Detail & Related papers (2019-12-30T22:41:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.