KatotohananQA: Evaluating Truthfulness of Large Language Models in Filipino
- URL: http://arxiv.org/abs/2509.06065v1
- Date: Sun, 07 Sep 2025 14:09:57 GMT
- Title: KatotohananQA: Evaluating Truthfulness of Large Language Models in Filipino
- Authors: Lorenzo Alfred Nery, Ronald Dawson Catignas, Thomas James Tiam-Lee,
- Abstract summary: We present KatotohananQA, a Filipino translation of the TruthfulQA benchmark.<n>Seven free-tier proprietary models were assessed using a binary-choice framework.<n>Findings show a significant performance gap between English and Filipino truthfulness.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) achieve remarkable performance across various tasks, but their tendency to produce hallucinations limits reliable adoption. Benchmarks such as TruthfulQA have been developed to measure truthfulness, yet they are primarily available in English, leaving a gap in evaluating LLMs in low-resource languages. To address this, we present KatotohananQA, a Filipino translation of the TruthfulQA benchmark. Seven free-tier proprietary models were assessed using a binary-choice framework. Findings show a significant performance gap between English and Filipino truthfulness, with newer OpenAI models (GPT-5 and GPT-5 mini) demonstrating strong multilingual robustness. Results also reveal disparities across question characteristics, suggesting that some question types, categories, and topics are less robust to multilingual transfer which highlight the need for broader multilingual evaluation to ensure fairness and reliability in LLM usage.
Related papers
- Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation [49.2073409243885]
Large language models (LLMs) excel at generating English counterfactuals and demonstrate multilingual proficiency.<n>We conduct automatic evaluations on both directly generated counterfactuals in the target languages and those derived via English translation across six languages.<n>We identify and categorize four main types of errors that consistently appear in the generated counterfactuals across languages.
arXiv Detail & Related papers (2026-01-01T08:53:49Z) - When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification [14.187153195380668]
Large language models have remarkable capabilities across many NLP tasks, but their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied.<n>We evaluate five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories.<n>Surprisingly, we find that XLM-R substantially outperforms all tested LLMs, achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%.
arXiv Detail & Related papers (2025-07-28T10:49:04Z) - Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks [33.2185998586144]
This study benchmarks the performance of multilingual and monolingual Large Language Models (LLMs) across Arabic, English, and Indic languages.<n>Findings show significant performance differences driven by linguistic diversity and resource availability.<n> Quantization (4-bit and 8-bit) is effective in maintaining model accuracy while promoting efficiency, but aggressive pruning significantly compromises performance.
arXiv Detail & Related papers (2025-07-25T22:35:10Z) - MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages [33.450081592217074]
We introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities.<n>We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage.
arXiv Detail & Related papers (2025-06-24T09:53:00Z) - Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z) - How Reliable is Multilingual LLM-as-a-Judge? [11.639184489330368]
We evaluate five models from different model families across five diverse tasks involving 25 languages.<n>We find that consistency varies significantly across languages, with particularly poor performance in low-resource languages.<n>We propose an ensemble strategy which improves the consistency of the multilingual judge in real-world applications.
arXiv Detail & Related papers (2025-05-18T02:32:35Z) - PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.<n>Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z) - Truth Knows No Language: Evaluating Truthfulness Beyond English [11.20320645651082]
We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish.<n>Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring.
arXiv Detail & Related papers (2025-02-13T15:04:53Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.