Related papers: Evaluating Cultural Knowledge Processing in Large Language Models: A Cognitive Benchmarking Framework Integrating Retrieval-Augmented Generation

Evaluating Cultural Knowledge Processing in Large Language Models: A Cognitive Benchmarking Framework Integrating Retrieval-Augmented Generation

URL: http://arxiv.org/abs/2511.01649v1
Date: Mon, 03 Nov 2025 15:04:23 GMT
Title: Evaluating Cultural Knowledge Processing in Large Language Models: A Cognitive Benchmarking Framework Integrating Retrieval-Augmented Generation
Authors: Hung-Shin Lee, Chen-Chi Chang, Ching-Yuan Chen, Yun-Hsiang Hsu,
Abstract summary: This study proposes a cognitive benchmarking framework to evaluate how large language models (LLMs) process and apply culturally specific knowledge.<n>The framework integrates Bloom's taxonomy with Retrieval-Augmented Generation (RAG) to assess model performance across six hierarchical cognitive domains.
Score: 3.141716989847573
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study proposes a cognitive benchmarking framework to evaluate how large language models (LLMs) process and apply culturally specific knowledge. The framework integrates Bloom's Taxonomy with Retrieval-Augmented Generation (RAG) to assess model performance across six hierarchical cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Using a curated Taiwanese Hakka digital cultural archive as the primary testbed, the evaluation measures LLM-generated responses' semantic accuracy and cultural relevance.

Related papers

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps [3.689494816536669]
This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities.<n>We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations.<n>Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in datasets translated.
arXiv Detail & Related papers (2025-10-15T11:25:33Z)
MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation [91.22008265721952]
MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned benchmark covering 8 Asian countries and 10 languages.<n>This is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech.<n>We propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity.
arXiv Detail & Related papers (2025-10-07T14:12:12Z)
Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens [9.000522371422628]
We introduce a four-part framework that categorizes how benchmarks frame culture.<n>We qualitatively examine 20 cultural benchmarks and identify six recurring methodological issues.<n>Our aim is to guide the development of cultural benchmarks that go beyond static recall tasks.
arXiv Detail & Related papers (2025-10-07T13:42:44Z)
Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment [38.24188183584244]
Reward models (RMs) are crucial for aligning large language models with diverse cultures.<n>Existing RM evaluations fall short in assessing cultural awareness due to the scarcity of culturally relevant datasets.<n>We propose Cultural Awareness Reward modeling Benchmark (CARB), covering 10 distinct cultures across 4 cultural domains.
arXiv Detail & Related papers (2025-09-26T02:56:06Z)
CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs [57.653830744706305]
CultureScope is the most comprehensive evaluation framework to date for assessing cultural understanding in large language models.<n>Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification.<n> Experimental results demonstrate that our method can effectively evaluate cultural understanding.
arXiv Detail & Related papers (2025-09-19T17:47:48Z)
Disentangling Language and Culture for Evaluating Multilingual Large Language Models [48.06219053598005]
This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs.<n>By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions cross-lingually.
arXiv Detail & Related papers (2025-05-30T14:25:45Z)
Cultural Learning-Based Culture Adaptation of Language Models [70.1063219524999]
Adapting large language models (LLMs) to diverse cultural values is a challenging task.<n>We present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning.
arXiv Detail & Related papers (2025-04-03T18:16:26Z)
Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings.<n>We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z)
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation [20.87296508045343]
We introduce Fuxi, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks.<n>We reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks.<n>Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development.
arXiv Detail & Related papers (2025-03-20T04:26:40Z)
Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture [4.467334566487944]
This study introduces a benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge. The study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, evaluating, and Creating. The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge.
arXiv Detail & Related papers (2024-09-03T02:50:04Z)
Multi-Dimensional Evaluation of Text Summarization with In-Context Learning [79.02280189976562]
In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning. Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization. We then analyze the effects of factors such as the selection and number of in-context examples on performance.
arXiv Detail & Related papers (2023-06-01T23:27:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.