A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models
- URL: http://arxiv.org/abs/2501.15147v1
- Date: Sat, 25 Jan 2025 09:11:15 GMT
- Title: A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models
- Authors: Zhongzhan Huang, Shanshan Zhong, Pan Zhou, Shanghua Gao, Marinka Zitnik, Liang Lin,
- Abstract summary: Oogiri game is a creativity-driven task requiring humor and associative thinking.
LoTbench is an interactive, causality-aware evaluation framework.
Results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable.
- Score: 100.16387798660833
- License:
- Abstract: Recently, numerous benchmarks have been developed to evaluate the logical reasoning abilities of large language models (LLMs). However, assessing the equally important creative capabilities of LLMs is challenging due to the subjective, diverse, and data-scarce nature of creativity, especially in multimodal scenarios. In this paper, we consider the comprehensive pipeline for evaluating the creativity of multimodal LLMs, with a focus on suitable evaluation platforms and methodologies. First, we find the Oogiri game, a creativity-driven task requiring humor, associative thinking, and the ability to produce unexpected responses to text, images, or both. This game aligns well with the input-output structure of modern multimodal LLMs and benefits from a rich repository of high-quality, human-annotated creative responses, making it an ideal platform for studying LLM creativity. Next, beyond using the Oogiri game for standard evaluations like ranking and selection, we propose LoTbench, an interactive, causality-aware evaluation framework, to further address some intrinsic risks in standard evaluations, such as information leakage and limited interpretability. The proposed LoTbench not only quantifies LLM creativity more effectively but also visualizes the underlying creative thought processes. Our results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable. Furthermore, we observe a strong correlation between results from the multimodal cognition benchmark MMMU and LoTbench, but only a weak connection with traditional creativity metrics. This suggests that LoTbench better aligns with human cognitive theories, highlighting cognition as a critical foundation in the early stages of creativity and enabling the bridging of diverse concepts. https://lotbench.github.io
Related papers
- Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.
EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.
Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z) - Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent Balderdash [6.65572931991284]
Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments.
This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs.
arXiv Detail & Related papers (2024-11-15T18:42:48Z) - Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation.
Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs.
Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z) - Benchmarking Language Model Creativity: A Case Study on Code Generation [39.546827184857754]
In this work, we introduce a framework for quantifying LLM creativity.
We define NEOGAUGE, a metric that quantifies both convergent and divergent thinking in the generated creative responses.
We test the proposed framework on Codeforces problems, which serve as both a natural dataset for coding tasks and a collection of prior human solutions.
arXiv Detail & Related papers (2024-07-12T05:55:22Z) - Divergent Creativity in Humans and Large Language Models [37.67363469600804]
The recent surge in the capabilities of Large Language Models has led to claims that they are approaching a level of creativity akin to human capabilities.
We leverage recent advances in creativity science to build a framework for in-depth analysis of divergent creativity in both state-of-the-art LLMs and a substantial dataset of 100,000 humans.
arXiv Detail & Related papers (2024-05-13T22:37:52Z) - LLM Discussion: Enhancing the Creativity of Large Language Models via Discussion Framework and Role-Play [43.55248812883912]
Large language models (LLMs) have shown exceptional proficiency in natural language processing but often fall short of generating creative and original responses to open-ended questions.
We propose LLM Discussion, a three-phase discussion framework that facilitates vigorous and diverging idea exchanges.
We evaluate the efficacy of the proposed framework with the Alternative Uses Test, Similarities Test, Instances Test, and Scientific Creativity Test.
arXiv Detail & Related papers (2024-05-10T10:19:14Z) - Assessing and Understanding Creativity in Large Language Models [33.37237667182931]
This paper aims to establish an efficient framework for assessing the level of creativity in large language models (LLMs)
By adapting the Torrance Tests of Creative Thinking, the research evaluates the creative performance of various LLMs across 7 tasks.
We found that the creativity of LLMs primarily falls short in originality, while excelling in elaboration.
arXiv Detail & Related papers (2024-01-23T05:19:47Z) - Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation [96.0573187419543]
Chain-of-Thought (CoT) guides large language models to reason step-by-step, and can motivate their logical reasoning ability.
We explore the Leap-of-Thought (LoT) abilities within large language models (LLMs)
LoT is a non-sequential, creative paradigm involving strong associations and knowledge leaps.
arXiv Detail & Related papers (2023-12-05T02:41:57Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.