Related papers: Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales

Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales

URL: http://arxiv.org/abs/2506.03619v1
Date: Wed, 04 Jun 2025 06:58:19 GMT
Title: Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales
Authors: Ayuto Tsutsumi, Yuu Jinnai,
Abstract summary: This study focuses on evaluating knowledge of folktales, specifically on knowledge of Yokai.<n>Yokai are supernatural creatures originating from Japanese folktales that continue to be popular motifs in art and entertainment today.<n>We introduce YokaiEval, a benchmark dataset consisting of 809 multiple-choice questions designed to probe knowledge about yokai.
Score: 2.9465623430708905
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although Large Language Models (LLMs) have demonstrated strong language understanding and generation abilities across various languages, their cultural knowledge is often limited to English-speaking communities, which can marginalize the cultures of non-English communities. To address the problem, evaluation of the cultural awareness of the LLMs and the methods to develop culturally aware LLMs have been investigated. In this study, we focus on evaluating knowledge of folktales, a key medium for conveying and circulating culture. In particular, we focus on Japanese folktales, specifically on knowledge of Yokai. Yokai are supernatural creatures originating from Japanese folktales that continue to be popular motifs in art and entertainment today. Yokai have long served as a medium for cultural expression, making them an ideal subject for assessing the cultural awareness of LLMs. We introduce YokaiEval, a benchmark dataset consisting of 809 multiple-choice questions (each with four options) designed to probe knowledge about yokai. We evaluate the performance of 31 Japanese and multilingual LLMs on this dataset. The results show that models trained with Japanese language resources achieve higher accuracy than English-centric models, with those that underwent continued pretraining in Japanese, particularly those based on Llama-3, performing especially well. The code and dataset are available at https://github.com/CyberAgentA ILab/YokaiEval.

Related papers

MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs [26.806566827956875]
MAKIEval is an automatic multilingual framework for evaluating cultural awareness in large language models.<n>It automatically identifies cultural entities in model outputs and links them to structured knowledge.<n>We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems.
arXiv Detail & Related papers (2025-05-27T19:29:40Z)
NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities [12.891810941315503]
This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community.<n>We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness.<n>We develop NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values.
arXiv Detail & Related papers (2025-05-23T21:18:40Z)
CARE: Assessing the Impact of Multilingual Human Preference Learning on Cultural Awareness [28.676469530858924]
We introduce textbfCARE, a multilingual resource containing 3,490 culturally specific questions and 31.7k responses with native judgments.<n>We demonstrate how a modest amount of high-quality native preferences improves cultural awareness across various LMs.<n>Our analyses reveal that models with stronger initial cultural performance benefit more from alignment.
arXiv Detail & Related papers (2025-04-07T14:57:06Z)
Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs [22.622778594671345]
We evaluated 35 Japanese, English, and multilingual LLMs on 19 evaluation benchmarks for Japanese and English.<n>We found that training on English text can improve the scores of academic subjects in Japanese.<n>It is unnecessary to specifically train on Japanese text to enhance abilities for solving Japanese code generation, arithmetic reasoning, commonsense, and reading comprehension tasks.
arXiv Detail & Related papers (2024-12-19T02:39:26Z)
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation [63.83457341009046]
JMMMU (Japanese MMMU) is the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context.<n>Using the CA subset, we observe performance drop in many LMMs when evaluated in Japanese, which is purely attributable to language variation.<n>By combining both subsets, we identify that some LMMs perform well on the CA subset but not on the CS subset, exposing a shallow understanding of the Japanese language that lacks depth in cultural understanding.
arXiv Detail & Related papers (2024-10-22T17:59:56Z)
Analyzing Cultural Representations of Emotions in LLMs through Mixed Emotion Survey [2.9213203896291766]
This study focuses on analyzing the cultural representations of emotions in Large Language Models (LLMs) Our methodology is based on the studies of Miyamoto et al. (2010), which identified distinctive emotional indicators in Japanese and American human responses. We find that models have limited alignment with the evidence in the literature.
arXiv Detail & Related papers (2024-08-04T20:56:05Z)
CaLMQA: Exploring culturally specific long-form question answering across 23 languages [58.18984409715615]
CaLMQA is a dataset of 51.7K culturally specific questions across 23 different languages.<n>We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers.
arXiv Detail & Related papers (2024-06-25T17:45:26Z)
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages [39.17279399722437]
Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages.<n>We introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages.<n>We construct the benchmark to include two formats of questions: short-answer and multiple-choice.
arXiv Detail & Related papers (2024-06-14T11:48:54Z)
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding. This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z)
CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting [73.94059188347582]
We uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations. We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures.
arXiv Detail & Related papers (2024-04-16T00:50:43Z)
Does Mapo Tofu Contain Coffee? Probing LLMs for Food-related Cultural Knowledge [47.57055368312541]
We introduce FmLAMA, a multilingual dataset centered on food-related cultural facts and variations in food practices.<n>We analyze LLMs across various architectures and configurations, evaluating their performance in both monolingual and multilingual settings.
arXiv Detail & Related papers (2024-04-10T08:49:27Z)
Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP. We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba. Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region. All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.