Related papers: Truth Knows No Language: Evaluating Truthfulness Beyond English

Truth Knows No Language: Evaluating Truthfulness Beyond English

URL: http://arxiv.org/abs/2502.09387v2
Date: Tue, 18 Feb 2025 09:35:45 GMT
Title: Truth Knows No Language: Evaluating Truthfulness Beyond English
Authors: Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri,
Abstract summary: We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish.<n>Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring.
Score: 11.20320645651082
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations of large language models (LLMs) have primarily been conducted in English. However, the ability of LLMs to maintain truthfulness across languages remains under-explored. Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our findings reveal that, while LLMs perform best in English and worst in Basque (the lowest-resourced language), overall truthfulness discrepancies across languages are smaller than anticipated. Furthermore, we show that LLM-as-a-Judge correlates more closely with human judgments than multiple-choice metrics, and that informativeness plays a critical role in truthfulness assessment. Our results also indicate that machine translation provides a viable approach for extending truthfulness benchmarks to additional languages, offering a scalable alternative to professional translation. Finally, we observe that universal knowledge questions are better handled across languages than context- and time-dependent ones, highlighting the need for truthfulness evaluations that account for cultural and temporal variability. Dataset and code are publicly available under open licenses.

Related papers

Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate [36.641755706551336]
Large language models (LLMs) provide detailed and impressive responses to queries in English.<n>But are they really consistent at responding to the same query in other languages?<n>We propose a framework to evaluate LLM's cross-lingual consistency based on a simple Translate then Evaluate strategy.
arXiv Detail & Related papers (2025-05-28T06:00:21Z)
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z)
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z)
Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators [38.681443695708786]
This study provides a comprehensive analysis of the multilingual evaluation performance of 10 recent LLMs. We found that excluding the reference answer from the prompt leads to better performance across various languages. Most LLM-based evaluators show a higher correlation with human judgments in high-resource languages than in low-resource languages.
arXiv Detail & Related papers (2025-03-06T12:04:29Z)
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do [2.2469442203227863]
We conduct a comprehensive analysis of automated evaluators, reporting several key findings on their behavior. We discover that English evaluation capabilities significantly influence language-specific evaluation capabilities, enabling evaluators trained in English to easily transfer their skills to other languages. We find that state-of-the-art evaluators struggle with challenging prompts, in either English or Korean, underscoring their limitations in assessing or generating complex reasoning questions.
arXiv Detail & Related papers (2024-09-17T14:40:02Z)
Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models [7.615938028813914]
We studied linguistic preference in a cross-language RAG-based information search setting. We found that LLMs displayed systemic bias towards information in the same language as the query language.
arXiv Detail & Related papers (2024-07-07T21:26:36Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
Selected Languages are All You Need for Cross-lingual Truthfulness Transfer [38.3269908062146]
We propose a practical method for cross-lingual truthfulness transfer called Fact-aware Multilingual Selective Synergy (FaMSS)<n>FaMSS is able to select an optimal subset of all tested languages by language bias and transfer contributions, and then employ translation instruction tuning for cross-lingual truthfulness transfer.
arXiv Detail & Related papers (2024-06-20T15:59:07Z)
Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners [67.85635044939836]
Large Language Models (LLMs) have shown impressive language capabilities. In this work, we investigate the spontaneous multilingual alignment improvement of LLMs. We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages.
arXiv Detail & Related papers (2024-05-22T16:46:19Z)
Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages [48.40607157158246]
Large Language Models (LLMs) perform better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate.<n>We propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations.<n>Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores.
arXiv Detail & Related papers (2024-04-17T16:53:16Z)
LLMs Are Few-Shot In-Context Low-Resource Language Learners [59.74451570590808]
In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages. We extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages. Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs.
arXiv Detail & Related papers (2024-03-25T07:55:29Z)
Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models [79.46179534911019]
Large language models (LLMs) have demonstrated multilingual capabilities, yet they are mostly English-centric due to imbalanced training corpora. We extend the evaluation to real-world user queries and non-English-centric LLMs, offering a broader examination of multilingual performance.
arXiv Detail & Related papers (2024-03-15T12:47:39Z)
Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages. Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba) We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z)
Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models [2.6626950367610402]
We study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. We propose a Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy.
arXiv Detail & Related papers (2023-10-16T13:19:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.