Related papers: Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino

Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino

URL: http://arxiv.org/abs/2409.15380v4
Date: Sat, 28 Jun 2025 07:25:45 GMT
Title: Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino
Authors: Jann Railey Montalan, Jian Gang Ngui, Wei Qi Leong, Yosephine Susanto, Hamsawardhini Rengarajan, Alham Fikri Aji, William Chandra Tjhi,
Abstract summary: We introduce Kalahi, a cultural LLM evaluation suite collaboratively created by native Filipino speakers.<n>Strong LLM performance in Kalahi indicates a model's ability to generate responses similar to what an average Filipino would say or do in a given situation.
Score: 8.305146753192858
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multilingual large language models (LLMs) today may not necessarily provide culturally appropriate and relevant responses to its Filipino users. We introduce Kalahi, a cultural LLM evaluation suite collaboratively created by native Filipino speakers. It is composed of 150 high-quality, handcrafted and nuanced prompts that test LLMs for generations that are relevant to shared Filipino cultural knowledge and values. Strong LLM performance in Kalahi indicates a model's ability to generate responses similar to what an average Filipino would say or do in a given situation. We conducted experiments on LLMs with multilingual and Filipino language support. Results show that Kalahi, while trivial for Filipinos, is challenging for LLMs, with the best model answering only 46.0% of the questions correctly compared to native Filipino performance of 89.10%. Thus, Kalahi can be used to accurately and reliably evaluate Filipino cultural representation in LLMs.

Related papers

FilBench: Can LLMs Understand and Generate Filipino? [2.029906424353094]
FilBench is a Filipino-centric benchmark designed to evaluate LLMs across a diverse set of tasks and capabilities in Filipino, Tagalog, and Cebuano.<n>By evaluating 27 state-of-the-art LLMs on FilBench, we find that several LLMs suffer from reading comprehension and translation capabilities.<n>Our work demonstrates the value of curating language-specific benchmarks to aid in driving progress on Filipino NLP.
arXiv Detail & Related papers (2025-08-05T14:48:32Z)
MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs [56.87573414161703]
We introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark to assess Large Language Models (LLMs)<n>MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance.<n>For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English.
arXiv Detail & Related papers (2025-07-23T12:56:31Z)
Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales [2.9465623430708905]
This study focuses on evaluating knowledge of folktales, specifically on knowledge of Yokai.<n>Yokai are supernatural creatures originating from Japanese folktales that continue to be popular motifs in art and entertainment today.<n>We introduce YokaiEval, a benchmark dataset consisting of 809 multiple-choice questions designed to probe knowledge about yokai.
arXiv Detail & Related papers (2025-06-04T06:58:19Z)
Disentangling Language and Culture for Evaluating Multilingual Large Language Models [48.06219053598005]
This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs.<n>By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions cross-lingually.
arXiv Detail & Related papers (2025-05-30T14:25:45Z)
NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities [12.891810941315503]
This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community.<n>We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness.<n>We develop NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values.
arXiv Detail & Related papers (2025-05-23T21:18:40Z)
Batayan: A Filipino NLP benchmark for evaluating Large Language Models [0.0]
Batayan is a holistic benchmark designed to evaluate large language models (LLMs) across three key natural language processing (NLP) competencies. Our rigorous, native-speaker-driven annotation process ensures fluency and authenticity to the complex morphological and syntactic structures of Filipino.
arXiv Detail & Related papers (2025-02-19T07:03:15Z)
CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs [75.82306181299153]
We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for assessing cultural knowledge. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%.
arXiv Detail & Related papers (2024-10-03T17:04:31Z)
Evaluating Cultural Awareness of LLMs for Yoruba, Malayalam, and English [1.3359598694842185]
We explore the ability of various LLMs to comprehend the cultural aspects of two regional languages: Malayalam (state of Kerala, India) and Yoruba (West Africa) We demonstrate that although LLMs show a high cultural similarity for English, they fail to capture the cultural nuances across these 6 metrics for Malayalam and Yoruba. This will have huge implications for enhancing the user experience of chat-based LLMs and also improving the validity of large-scale LLM agent-based market research.
arXiv Detail & Related papers (2024-09-14T02:21:17Z)
Cultural Value Differences of LLMs: Prompt, Language, and Model Size [35.176429953825924]
Our study aims to identify behavior patterns in cultural values exhibited by large language models (LLMs) The studied variants include question ordering, prompting language, and model size. Our experiments reveal that query language and model size of LLM are the main factors resulting in cultural value differences.
arXiv Detail & Related papers (2024-06-17T12:35:33Z)
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages [39.17279399722437]
Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. We introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. We construct the benchmark to include two formats of questions: short-answer and multiple-choice.
arXiv Detail & Related papers (2024-06-14T11:48:54Z)
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding. This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z)
Does Mapo Tofu Contain Coffee? Probing LLMs for Food-related Cultural Knowledge [47.57055368312541]
We introduce FmLAMA, a multilingual dataset centered on food-related cultural facts and variations in food practices. We analyze LLMs across various architectures and configurations, evaluating their performance in both monolingual and multilingual settings.
arXiv Detail & Related papers (2024-04-10T08:49:27Z)
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge [69.82940934994333]
We introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build challenging evaluation dataset. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions. CULTURALBENCH-V0.1 is a compact yet high-quality evaluation dataset with users' red-teaming attempts.
arXiv Detail & Related papers (2024-04-10T00:25:09Z)
Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models [79.46179534911019]
Large language models (LLMs) have demonstrated multilingual capabilities; yet, they are mostly English-centric due to imbalanced training corpora. This work extends the evaluation from NLP tasks to real user queries. For culture-related tasks that need deep language understanding, prompting in the native language tends to be more promising.
arXiv Detail & Related papers (2024-03-15T12:47:39Z)
Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance [16.7036374022386]
We assess the impact of politeness in prompts on large language models (LLMs) across English, Chinese, and Japanese tasks. We observed that impolite prompts often result in poor performance, but overly polite language does not guarantee better outcomes.
arXiv Detail & Related papers (2024-02-22T13:24:10Z)
Are Multilingual LLMs Culturally-Diverse Reasoners? An Investigation into Multicultural Proverbs and Sayings [73.48336898620518]
Large language models (LLMs) are highly adept at question answering and reasoning tasks. We study the ability of a wide range of state-of-the-art multilingual LLMs to reason with proverbs and sayings in a conversational context.
arXiv Detail & Related papers (2023-09-15T17:45:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.