No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding
- URL: http://arxiv.org/abs/2602.03709v1
- Date: Tue, 03 Feb 2026 16:32:00 GMT
- Title: No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding
- Authors: Vynska Amalia Permadi, Xingwei Tan, Nafise Sadat Moosavi, Nikos Aletras,
- Abstract summary: We introduce ID-MoCQA, the first large-scale multi-hop QA dataset for assessing the cultural understanding of large language models.<n>We present a new framework that transforms single-hop cultural questions into multi-hop reasoning chains spanning six clue types.
- Score: 10.749595729794692
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding culture requires reasoning across context, tradition, and implicit social knowledge, far beyond recalling isolated facts. Yet most culturally focused question answering (QA) benchmarks rely on single-hop questions, which may allow models to exploit shallow cues rather than demonstrate genuine cultural reasoning. In this work, we introduce ID-MoCQA, the first large-scale multi-hop QA dataset for assessing the cultural understanding of large language models (LLMs), grounded in Indonesian traditions and available in both English and Indonesian. We present a new framework that systematically transforms single-hop cultural questions into multi-hop reasoning chains spanning six clue types (e.g., commonsense, temporal, geographical). Our multi-stage validation pipeline, combining expert review and LLM-as-a-judge filtering, ensures high-quality question-answer pairs. Our evaluation across state-of-the-art models reveals substantial gaps in cultural reasoning, particularly in tasks requiring nuanced inference. ID-MoCQA provides a challenging and essential benchmark for advancing the cultural competency of LLMs.
Related papers
- LLMs as Cultural Archives: Cultural Commonsense Knowledge Graph Extraction [57.23766971626989]
Large language models (LLMs) encode rich cultural knowledge learned from diverse web-scale data.<n>We present an iterative, prompt-based framework for constructing a Cultural Commonsense Knowledge Graph (CCKG)<n>We find that the cultural knowledge graphs are better realized in English, even when the target culture is non-English.
arXiv Detail & Related papers (2026-01-25T20:05:04Z) - Do Large Language Models Truly Understand Cross-cultural Differences? [53.481048019144644]
We develop a scenario-based benchmark to evaluate large language models' cross-cultural understanding and reasoning.<n>Grounded in cultural theory, we categorize cross-cultural capabilities into nine dimensions.<n>The dataset supports continuous expansion, and experiments confirm its transferability to other languages.
arXiv Detail & Related papers (2025-12-08T01:21:58Z) - MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation [91.22008265721952]
MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned benchmark covering 8 Asian countries and 10 languages.<n>This is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech.<n>We propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity.
arXiv Detail & Related papers (2025-10-07T14:12:12Z) - CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs [57.653830744706305]
CultureScope is the most comprehensive evaluation framework to date for assessing cultural understanding in large language models.<n>Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification.<n> Experimental results demonstrate that our method can effectively evaluate cultural understanding.
arXiv Detail & Related papers (2025-09-19T17:47:48Z) - Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation [2.0467354053171243]
We present the first comprehensive evaluation of Vision-Language Models (VLMs) cultural competence through multimodal story generation.<n>Our analysis reveals significant cultural adaptation capabilities, with rich culturally-specific vocabulary spanning names, familial terms, and geographic markers.<n>We uncover concerning limitations: cultural competence varies dramatically across architectures, some models exhibit inverse cultural alignment, and automated metrics show architectural bias contradicting human assessments.
arXiv Detail & Related papers (2025-08-22T19:39:02Z) - Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition [0.0]
We introduce Nunchi-Bench, a benchmark designed to evaluate large language models' cultural understanding.<n>The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropriate advice, and situational interpretation.<n>We evaluate multilingual LLMs in both Korean and English to analyze their ability to reason about Korean cultural contexts.
arXiv Detail & Related papers (2025-07-05T11:52:09Z) - TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs [13.069833806549914]
We propose the Traditional Chinese Culture understanding Benchmark (TCC-Bench) for assessing the understanding of traditional Chinese culture.<n>TCC-Bench comprises culturally rich and visually diverse data, incorporating images from museum artifacts, everyday life scenes, comics, and other culturally significant contexts.<n>We adopt a semi-automated pipeline that utilizes GPT-4o in text-only mode to generate candidate questions, followed by human curation to ensure data quality and avoid potential data leakage.
arXiv Detail & Related papers (2025-05-16T14:10:41Z) - Through the Prism of Culture: Evaluating LLMs' Understanding of Indian Subcultures and Traditions [9.331687165284587]
We evaluate the capacity of Large Language Models to recognize and accurately respond to the Little Traditions within Indian society.<n>Through a series of case studies, we assess whether LLMs can balance the interplay between dominant Great Traditions and localized Little Traditions.<n>Our findings reveal that while LLMs demonstrate an ability to articulate cultural nuances, they often struggle to apply this understanding in practical, context-specific scenarios.
arXiv Detail & Related papers (2025-01-28T06:58:25Z) - CaLMQA: Exploring culturally specific long-form question answering across 23 languages [58.18984409715615]
CaLMQA is a dataset of 51.7K culturally specific questions across 23 different languages.<n>We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers.
arXiv Detail & Related papers (2024-06-25T17:45:26Z) - Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding.
This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.