Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge in
Foundation Models
- URL: http://arxiv.org/abs/2305.13675v2
- Date: Wed, 6 Dec 2023 01:54:34 GMT
- Title: Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge in
Foundation Models
- Authors: Tim Schott, Daniel Furman, and Shreshta Bhat
- Abstract summary: We assess the ability of foundation models to recall encyclopedic knowledge across a wide range of linguistic contexts.
We produce a 20-language dataset that contains 303k factual associations paired with counterfactuals.
We evaluate 5 models in a multilingual test, and benchmark a diverse set of 24 models in an English-only test.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we assess the ability of foundation models to recall
encyclopedic knowledge across a wide range of linguistic contexts. To support
this, we: 1) produce a 20-language dataset that contains 303k factual
associations paired with counterfactuals, 2) evaluate 5 models in a
multilingual test, and 3) benchmark a diverse set of 24 models in an
English-only test. Meta's LLaMA achieves the highest scores in both
multilingual and English-only evaluations. Yet, an analysis of LLaMA's errors
reveals significant limitations in its ability to recall facts in languages
other than English, plus difficulties related to the location and gender of
fact subjects. Overall, our findings suggest that today's foundation models are
far from polyglots.
Related papers
- Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish [12.286855282078305]
GPT-4o, GPT-4, Claude3.5Sonnet, LLaMA3.1, MistralLarge2, LLaMA-2Chat13B, and Mistral7BInstruct are evaluated.<n>Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue.
arXiv Detail & Related papers (2025-11-05T22:09:53Z) - Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages [0.0]
Large language models (LLMs) are transforming social-science research by enabling scalable, precise analysis.<n>We fine-tune lightweight LLaMA 3.2-3B models on monolingual, bilingual, or multilingual data sets to classify immigration-related tweets.<n>We evaluate whether minimal language-specific fine-tuning enables cross-lingual topic detection and whether adding targeted languages corrects pre-training biases.
arXiv Detail & Related papers (2025-08-08T16:23:24Z) - Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z) - Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z) - PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.
Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z) - MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [60.52580061637301]
MMLU-ProX is a comprehensive benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language.
We evaluate 25 state-of-the-art large language models (LLMs) using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries.
Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili.
arXiv Detail & Related papers (2025-03-13T15:59:20Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance [6.907734681124986]
This paper strategically identifies the need for linguistic equity by examining several knowledge editing techniques in multilingual contexts.
We evaluate the performance of models such as Mistral, TowerInstruct, OpenHathi, Tamil-Llama, and Kan-Llama across languages including English, German, French, Italian, Spanish, Hindi, Tamil, and Kannada.
arXiv Detail & Related papers (2024-06-17T01:54:27Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ [16.637598165238934]
Large language models (LLMs) need to serve everyone, including a global majority of non-English speakers.
Recent research shows that, despite limits in their intended use, people prompt LLMs in many different languages.
We introduce MultiQ, a new silver standard benchmark for basic open-ended question answering with 27.4k test questions.
arXiv Detail & Related papers (2024-03-06T16:01:44Z) - Cross-Lingual Consistency of Factual Knowledge in Multilingual Language
Models [2.6626950367610402]
We study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs.
We propose a Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy.
arXiv Detail & Related papers (2023-10-16T13:19:17Z) - Cross-Lingual Knowledge Editing in Large Language Models [73.12622532088564]
Knowledge editing has been shown to adapt large language models to new knowledge without retraining from scratch.
It is still unknown the effect of source language editing on a different target language.
We first collect a large-scale cross-lingual synthetic dataset by translating ZsRE from English to Chinese.
arXiv Detail & Related papers (2023-09-16T11:07:52Z) - SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning [44.53966523376327]
SeaEval is a benchmark for multilingual foundation models.
We characterize how these models understand and reason with natural language.
We also investigate how well they comprehend cultural practices, nuances, and values.
arXiv Detail & Related papers (2023-09-09T11:42:22Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Probing Multilingual Language Models for Discourse [0.0]
We find that the XLM-RoBERTa family of models consistently show the best performance.
Our results also indicate that model distillation may hurt the ability of cross-lingual transfer of sentence representations.
arXiv Detail & Related papers (2021-06-09T06:34:21Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.