Fugu-MT 論文翻訳(概要): CreativityPrism: A Holistic Benchmark for Large Language Model Creativity

論文の概要: CreativityPrism: A Holistic Benchmark for Large Language Model Creativity

arxiv url: http://arxiv.org/abs/2510.20091v1
Date: Thu, 23 Oct 2025 00:22:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:17.024323
Title: CreativityPrism: A Holistic Benchmark for Large Language Model Creativity
Title（参考訳）: CreativityPrism: 大規模言語モデルの創造性のためのホロスティックベンチマーク
Authors: Zhaoyi Joey Hou, Bowei Alvin Zhang, Yining Lu, Bhiman Kumar Baghel, Anneliese Brei, Ximing Lu, Meng Jiang, Faeze Brahman, Snigdha Chaturvedi, Haw-Shiuan Chang, Daniel Khashabi, Xiang Lorraine Li,
Abstract要約: 創造性はしばしば人間の知能の目印と見なされる。さまざまなシナリオにまたがってクリエイティビティを評価するための総合的なフレームワークはまだ存在しません。本稿では,創造性を質,新規性,多様性の3次元に分解する評価分析フレームワークであるCreativePrismを提案する。
参考スコア（独自算出の注目度）: 64.18257552903151
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity. CreativityPrism incorporates nine tasks, three domains, i.e., divergent thinking, creative writing, and logical reasoning, and twenty evaluation metrics, which measure each dimension in task-specific, unique ways. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and analyze the performance correlations among different metrics and task domains. Our results reveal a notable gap between proprietary and open-source models. Overall, model performance tends to be highly correlated across tasks within the same domain and less so across different domains. Among evaluation dimensions, diversity and quality metrics show strong correlations - models that perform well on one often excel on the other - whereas novelty exhibits much weaker correlation with either. These findings support our hypothesis that strong performance in one creativity task or dimension does not necessarily generalize to others, underscoring the need for a holistic evaluation of LLM creativity.
Abstract（参考訳）: 創造性はしばしば人間の知能の目印と見なされる。大規模言語モデル(LLM)は、創造的なテキストを生み出すと認識されることが多いが、様々なシナリオでそれらの創造性を評価するための総合的なフレームワークはいまだに存在しない。既存の評価手法は断片化され続けており、領域やタスクの劇的な変化は、主に創造性の定義と測定が異なるためである。創造性は一つの固定概念ではないという仮説に着想を得て,創造性を質,新奇性,多様性の3次元に分解する評価分析フレームワークであるCreativePrismを提案する。創造性プリズムは9つのタスク、すなわち散在的思考、創造的記述、論理的推論の3つの領域、およびタスク固有の各次元をユニークな方法で測定する20の評価指標を組み込んでいる。我々は、CreativePrism上で17の最先端(SoTA)プロプライエタリかつオープンソースLLMを評価し、異なるメトリクスとタスクドメイン間のパフォーマンス相関を分析した。この結果から,プロプライエタリモデルとオープンソースモデルとの差が顕著であることがわかった。全体として、モデルパフォーマンスは、同じドメイン内のタスク間で高い相関関係を持ち、異なるドメイン間ではそれほど高くない傾向にあります。評価の次元の中で、多様性と品質のメトリクスは強い相関関係を示します。これらの結果は、ある創造性タスクや次元における強いパフォーマンスが必ずしも他者に一般化するとは限らないという仮説を支持し、LCMの創造性を総合的に評価する必要性を強調している。

論文の概要: CreativityPrism: A Holistic Benchmark for Large Language Model Creativity

関連論文リスト