Fugu-MT 論文翻訳(概要): CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

論文の概要: CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

arxiv url: http://arxiv.org/abs/2606.01879v1
Date: Mon, 01 Jun 2026 08:25:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:31.611454
Title: CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs
Title（参考訳）: カルチャーフォレスト:LLMにおける文化的ノルム接地推論の理解と評価
Authors: Yangfan Ye, Xiaocheng Feng, Jialong Tang, Xiayu Cao, Zihan Zhang, Xiachong Feng, Baosong Yang, Bing Qin,
Abstract要約: TextitCultural Norm Grounded ReasoningのベンチマークであるCultureForestを紹介する。 CultureForestは8つのドメインで5,378のサンプルと53の国/リージョンで構成されている。大規模な実験では、トップ層モデルでさえ、オープンな設定で大幅に劣化していることが明らかになった。
参考スコア（独自算出の注目度）: 59.09971492475217
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce CultureForest, a benchmark for \textit{Cultural Norm Grounded Reasoning}. Each question is grounded in a small set of atomic norms, enabling verifiable and attributable evaluation. CultureForest comprises 5,378 examples across 8 domains and 53 countries/regions, and supports a progressive evaluation from multiple-choice to open-ended generation. Extensive experiments reveal that even top-tier models degrade substantially in open-ended settings, accompanied by pronounced cross-region disparities. Through targeted analysis, we uncover several consistent patterns: (1) test-time reasoning yields limited gains and may exacerbate inequity; (2) models exhibit highly shared regional preference structures; (3) model responses are markedly conservative, especially under stricter cultural constraints; and (4) by disentangling cultural knowledge acquisition from cultural reasoning, we show that while LLMs possess substantial cultural knowledge, their performance is further bottlenecked by its effective use. These findings point to a necessary shift from knowledge-centric evaluation toward measuring knowledge-grounded reasoning.
Abstract（参考訳）: 既存の研究は、LLMの文化的知性を知識レベルの問題に大きく還元し、モデルが現実的なシナリオで獲得した知識を効果的に活用できるかどうかを見極めている。このギャップを埋めるために、 \textit{Cultural Norm Grounded Reasoning} のベンチマークである CultureForest を紹介します。各質問は原子ノルムの小さなセットに基礎を置いており、検証可能で帰属可能な評価を可能にしている。 CultureForestは8つのドメインと53の国/リージョンで5,378のサンプルで構成され、複数選択からオープンエンド世代へのプログレッシブな評価をサポートする。大規模な実験により、トップ層モデルでさえ、領域間の差異が顕著に示されるように、オープンエンド設定で大幅に劣化することが明らかとなった。対象分析により,(1)テストタイム推論が限られた利得を達成し,不平等を悪化させる可能性のあるパターン,(2)高度に共有された地域選好構造を示すモデル,(3)モデル応答は特に厳格な文化的制約の下で顕著に保守的であり,(4)文化的推論から文化的知識の獲得を遠ざけることによって,LCMが実質的な文化的知識を持つ一方で,その有効利用によってそのパフォーマンスがさらにボトルネックとなることを示す。これらの結果から,知識中心評価から知識基盤推論への変化が示唆された。

関連論文リスト

CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs [24.598338950728234]
大規模言語モデル(LLM)は、文化的に多様な環境にますます展開されている。既存の方法は、非文脈的正当性や強制選択判断に重点を置いている。現実的な状況下でモデルを提示するベンチマークのセットを紹介する。
論文参考訳（メタデータ） (2025-11-15T03:39:13Z)
Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment [38.24188183584244]
リワードモデル(RM)は、大きな言語モデルと多様な文化の整合に不可欠である。既存のRM評価は、文化的に関連するデータセットが不足しているため、文化的意識を評価するには不十分である。文化意識リワードモデリングベンチマーク (CARB) を提案する。
論文参考訳（メタデータ） (2025-09-26T02:56:06Z)
CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs [57.653830744706305]
CultureScopeは、大規模な言語モデルにおける文化的理解を評価するための、これまでで最も包括的な評価フレームワークである。文化的な氷山理論に触発されて、文化知識分類のための新しい次元スキーマを設計する。実験結果から,文化的理解を効果的に評価できることが示唆された。
論文参考訳（メタデータ） (2025-09-19T17:47:48Z)
From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [62.9861554207279]
LLM(Large Language Models)における文化的価値の適応は大きな課題である。これまでの作業は主に、World Values Survey (WVS)データを使用して、LLMをさまざまな文化的価値と整合させる。我々は,文化価値適応のためのWVSベースのトレーニングについて検討し,調査データのみに頼って文化規範を実践し,事実知識に干渉することを発見した。
論文参考訳（メタデータ） (2025-05-22T09:00:01Z)
Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs [7.802103248428407]
我々は,現在の調査に基づく評価手法の背景にある3つの仮定を特定し,検証する。提示形式間の不安定性,評価された文化次元と保持された文化的次元間の不整合性,即時操舵時の不整合性などについて検討した。
論文参考訳（メタデータ） (2025-03-11T17:59:53Z)
Culture is Not Trivia: Sociocultural Theory for Cultural NLP [10.76392030245232]
これらの方法論的限界は理論的ギャップのシンプトマティックなものであると論じる。我々は、このギャップを埋めるために、社会文化的言語学から発達した文化の理論を描いている。
論文参考訳（メタデータ） (2025-02-17T17:25:11Z)
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
大規模言語モデル(LLM)は、かなりの常識的理解を示している。本稿では,文化的コモンセンスタスクの文脈におけるいくつかの最先端LCMの能力と限界について検討する。
論文参考訳（メタデータ） (2024-05-07T20:28:34Z)
Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models [89.94270049334479]
本稿では,大規模言語モデル(LLM)における文化的優位性について述べる。 LLMは、ユーザーが非英語で尋ねるときに期待する文化とは無関係な、不適切な英語文化関連の回答を提供することが多い。
論文参考訳（メタデータ） (2023-10-19T05:38:23Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。