Fugu-MT 論文翻訳(概要): CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

論文の概要: CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

arxiv url: http://arxiv.org/abs/2604.19262v1
Date: Tue, 21 Apr 2026 09:21:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.701696
Title: CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
Title（参考訳）: グランドタスクにおけるLLMの多言語的・多文化的能力のベンチマーク
Authors: Peiqin Lin, Chenyang Lyu, Wenjiang Luo, Haotian Ye, Md Mehrab Hossain, Chunlan Ma, Shaoxiong Ji, Younes Samih, Bo Zeng, Fan Jiang, Yuanbin Cao, Dilda Duisenbek, Adrian Neo Sau Xun, Daria Pozdniakova, Liubou Misevich, Nevena Marinković, Ngoc Gia Linh Nguyen, Thi Khanh Linh Do, Sarakmatak Sophy, Baotian Hu, Guanhua Chen, Gongbo Tang, Alham Fikri Aji, Longyue Wang, Weihua Luo,
Abstract要約: CulturALLは、大規模言語モデルの多言語的・多文化的な能力を評価するためのベンチマークである。 51の領域から14の言語で2,610のサンプルが含まれており、16のトピックに分散して、接地されたタスクの全幅をキャプチャしている。実験の結果、最高のLLMはカルトゥルルルで44.48%の精度を達成し、改善の余地があることが示されている。
参考スコア（独自算出の注目度）: 64.22418143822016
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs' multilingual and multicultural competence on grounded tasks. CulturALL is built via a human--AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.
Abstract（参考訳）: 大規模言語モデル(LLM)が世界中に展開され、多言語と多文化の能力を測定するベンチマークが急増している。しかしながら、これらのベンチマークは一般的な言語理解や表面的な文化的トリビアを優先し、基礎のあるタスクの評価を残す。そこでは、モデルが現実世界、コンテキストに富んだシナリオ内で推論する必要がある。このギャップを埋めるために、我々は、LLMの多言語および多文化的な能力を評価するための総合的かつ挑戦的なベンチマークであるCulturALLを紹介する。CulturALLは、人間とAIの協調フレームワークによって構築されている。専門家のアノテータが適切な困難と事実の正確さを保証する一方で、LLMは手作業の負荷を軽くする。多様なソースを統合することで、CulturALLは包括的なシナリオカバレッジを保証する。それぞれのアイテムは、高い難易度を示すために慎重に設計されており、CulturALLは困難である。 CulturALLには51の領域から14の言語で2,610のサンプルが含まれている。実験の結果、最高のLLMはカルトゥルルルで44.48%の精度を達成し、改善の余地があることが示されている。

論文の概要: CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

関連論文リスト